My Blog Posts, in Reverse Chronological Order

subscribe via RSS or by signing up with your email here.

My PhD Dissertation Talk

May 23, 2021

The long wait is over. After many years, I am excited to share that I delivered my PhD dissertation talk. I gave it on May 13, 2021 via Zoom. I recorded the 45-minute talk and will release the video in June, to give me a chance to push for another paper submission this summer from the content at the end of the talk without worrying about getting scooped.

I had multiple opportunities to practice the PhD talk, as I gave several talks earlier with a substantial amount of overlap, such as the one “at” Toronto in March (see the blog post here). My PhD talk, like prior talks, heavily focuses on robot manipulation of deformables, and includes discussions of my IROS 2020, RSS 2020, and ICRA 2021 papers. However, I wanted the focus to be broader than deformable manipulation alone, so I structured the talk to feature “robot learning” prominently, of which “deformable manipulation” is one particular example of robot learning. Then, rather than go through the “Model-Free,” “Model-Based,” and “Transporter Network” sections from my prior talks, I chose to title talk sections as follows: “Simulated Interactions,” “Architectural Priors,” and “Curricula.” This also gave me the chance to feature some of my curriculum learning work with John Canny.

The audience had some questions at the end, but overall, the questions were generally not too difficult to answer. Perhaps in years past, it was typical to have very challenging questions at the end of a dissertation talk, and students may have failed if they couldn’t answer well enough. Nowadays, every Berkeley EECS PhD student who gives a dissertation talk is expected to pass. I’m not aware of anyone failing after giving the talk.

I want to thank everyone who helped me get to this point today, especially when earlier in my PhD, I thought I would never reach this point. Or at the very least, I thought I would not have as strong a research record as I now have. A proper and more detailed set of acknowledgments will come at a later date.

I am not a “Doctor” yet, since I still need to write up the actual dissertation itself, which I will do this summer by “stitching” together my 4-5 most relevant first-author papers. Nonetheless, giving this talk is a huge step forward in finishing up my PhD, and I am hugely relieved that it’s out of the way.

I will also be starting a postdoc position in a few months. More on that to come later …










Inverse Reinforcement Learning from Preferences

Apr 1, 2021

It’s been a long time since I engaged in a detailed read through of an inverse reinforcement learning (IRL) paper. The idea is that, rather than the standard reinforcement learning problem where an agent explores to get samples and finds a policy to maximize the expected sum of discounted rewards, we instead are given data already, and must determine the reward function. After this reward function is learned, one can then learn a new policy based on this reward function by running standard reinforcement learning, but where the rewards for each state (or state-action) is determined from the learned reward function. As a side note, since this appears to be quite common and “part of” IRL, then I’m not sure why IRL is often classified as an “imitation learning” algorithm when reinforcement learning has to be run as a subroutine. Keep this in mind when reading papers on imitation learning, which often categorize algorithms as supervised learning (e.g., behavioral cloning) approaches vs IRL approaches, such as in the introduction of the famous Generative Adversarial Imitation Learning paper.

In the rest of this post, we’ll cover two closely-related works on IRL that cleverly and effectively rely on preference rankings among trajectories. They also have similar acronyms: T-REX and D-REX. The T-REX paper presents the Trajectory-ranked Reward Extrapolation algorithm, which is also used in the D-REX paper (Disturbance-based Reward Extrapolation). So we shall first discuss how reward extrapolation works in T-REX, and then we will clarify the difference between the two papers.

T-REX and D-REX

The motivation for T-REX is that in IRL, most approaches rely on defining a reward function which explains the demonstrator data and makes it appear optimal. But, what if we have suboptimal demonstrator data? Then, rather than fit a reward function to this data, it may be better to instead figure out the appropriate features of the data that convey information about the underlying intentions of the demonstrator, which may be extrapolated beyond the data. T-REX does this by working with a set of demonstrations which are ranked.

To be concrete, denote a sequence of $m$ ranked trajectories:

\[\mathcal{D} = \{ \tau_1, \ldots, \tau_m \}\]

where if $i<j$, then $\tau_i \prec \tau_j$, or in other words, trajectory $\tau_i$ is worse than $\tau_j$. We’ll assume that each $\tau_i$ consists of a series of states, so that neither demonstrator actions nor the reward are needed (a huge plus!):

\[\tau_i = (s_0^{(i)}, s_1^{(i)}, \ldots, s_T^{(i)})\]

and we can also assume that the trajectory lengths are all the same, though this isn’t a strict requirement of T-REX (since we can normalize based on length) but probably makes it more numerically stable.

From this data $\mathcal{D}$, T-REX will train a learned reward function $\hat{R}_\theta(s)$ such that:

\[\sum_{s \in \tau_i} \hat{R}_\theta(s) < \sum_{s \in \tau_j} \hat{R}_\theta(s) \quad \mbox{if} \quad \tau_i \prec \tau_j\]

To be clear, in the above equation there is no true environment reward at all. It’s just the learned reward function $\hat{R}_\theta$, along with the trajectory rankings. That’s it! One may, of course, use the true reward function to determine the rankings in the first place, but that is not required, and that’s a key flexibility advantage for T-REX – there are many other ways we can rank trajectories.

In order to train $\hat{R}_\theta$ so the above criteria is satisfied, we can use the cross entropy loss function. Most people probably start using the cross-entropy loss function in the context of classification tasks, where the neural network outputs some “logits” and the loss function tries to “get” the logits to match a true one-hot vector distribution. In this case, the logic is similar. The output of the reward network forms the (un-normalized) probability that one trajectory is preferable to another:

\[P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)) \approx \frac{\exp \sum_{s \in \tau_j} \hat{R}_\theta(s) }{ \exp \sum_{s \in \tau_i}\hat{R}_\theta(s) + \exp \sum_{s \in \tau_j}\hat{R}_\theta(s) }\]

when we then use in this loss function:

\[\mathcal{L}(\theta) = - \sum_{\tau_i \prec \tau_j } \log \left( \frac{\exp \sum_{s \in \tau_j} \hat{R}_\theta(s) }{\exp \sum_{s \in \tau_i} \hat{R}_\theta(s)+ \exp \sum_{s \in \tau_j}\hat{R}_\theta(s) } \right)\]

Let’s deconstruct what we’re looking at here. The loss function $\mathcal{L}(\theta)$ for training $\hat{R}_\theta$ is binary cross entropy, where the two “classes” involved here are whether $\tau_i \succ \tau_j$ or $\tau_i \prec \tau_j$. (We can easily extend this to include cases when the two are equal, but let’s ignore for now.) Above, the true class corresponds to $\tau_i \prec \tau_j$.

If this isn’t clear then reviewing the cross entropy (e.g., from this source), we see that between a true distribution “$p$” and a predicted distribution “$q$”, it is defined as: $-\sum_x p(x) \log q(x)$ where the sum over $x$ iterates through all possible classes – in this case we only have two classes. The true distribution is $p=[0,1]$ if we interpret the two components as expressing the class $\tau_i \succ \tau_j$ at index 0, or $\tau_i \prec \tau_j$ at index 1. In all cases, the “class” we assign is to index 1 by design. The predicted distribution comes from the output of the reward function network:

\[q = \Big[1 - P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)), \; P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)) \Big]\]

and putting this together, the cross entropy term reduces to $\mathcal{L}(\theta)$ as shown above, for a single training data point (i.e., a single training pair $(\tau_i, \tau_j)$). We would then sample many of these pairs during training for each minibatch.

To get this to work in cases when the two trajectories are ambiguous, then you can set the “target” distribution to be $[0.5, 0.5]$. This is made explicit in this NeurIPS 2018 paper from DeepMind which uses the same loss function.

The main takeaway is that this process will learn a reward function assigning greater total return to higher ranked trajectories. As long as there are features associated with higher return that are identifiable from the data, then it may be possible to extrapolate beyond the data.

Once the reward function is learned, T-REX then runs policy optimization by running reinforcement learning, which in both papers here is Proximal Policy Optimization. This is done in an online fashion, but where instead of data coming in as $(s,a,r,s’)$ tuples, they will be $(s,a,\hat{R}_\theta(s),s’)$, where the reward is from the learned policy.

This makes sense, but as usual, there are a bunch of practical tips and tricks to get things working. Here are some for T-REX:

  • For many environments, “trajectories” often refer to “episodes”, but these can last for a large number of time steps. To perform data augmentation, one can subsample trajectories of the same length among pairs of trajectories $\tau_i$ and $\tau_j$.

  • Training an ensemble of reward functions for $\hat{R}_\theta$ often helps, provided the individual components have values at roughly the same scale.

  • The reward used for the policy optimization stage might need some extra “massaging” to it. For example, with MuJoCo, the authors use a control penalty term that gets added to $\hat{R}_\theta(s)$.

  • To check if reward extrapolation is feasible, one can plot a graph that shows ground truth returns on the x-axis and predicted return on the y-axis. If there is strong correlation among the two, then that’s a sign extrapolation is more likely to happen.

In both T-REX and D-REX, the authors experiment with discrete control and continuous control using standard environments from Atari and MuJoCo, respectively, and find that overall, their two stage approach of (1) finding $\hat{R}_\theta$ from preferences and (2) running PPO on top of this learned reward function, works better than competing baselines such as Behavior Cloning and Generative Adversarial Imitation Learning, and that they can exceed the performance of the demonstration data.

The above is common to both T-REX and D-REX. So what’s the difference between the two papers?

  • T-REX assumes that we have rankings available ahead of time. This can be from a number of sources. Maybe they were “ground truth” rankings based on ground truth rewards (i.e., just sum up the true reward within the $\tau_i$s), or they might be noisy rankings. An easy way to test noisy rankings is to rank trajectories based on the time in training history if we extract trajectories from an RL agent’s history. Another, but more cumbersome way (since it relies on human subjects) is to use Amazon Mechanical Turk. The T-REX paper does a splendid job testing these different rankings – it’s one reason I really like the paper.

  • In contrast, D-REX assumes these rankings are not available ahead of time. Instead, the approach involves training a policy from the provided demonstration data via Behavior Cloning, then taking that resulting snapshot and rolling it out in the environment with different noise levels. This naturally provides a ranking for the data, and only relies on the weak assumption that the Behavior Cloning agent will be better than a purely random policy. Then with these automatic rankings, D-REX can just do exactly what T-REX did!

  • D-REX makes a second contribution on the theoretical side to better understand why preferences over demonstrations can reduce reward function ambiguity in IRL.

Some Theory in D-REX

Here’s a little more on the theory from D-REX. We’ll follow the notation from the paper and state Theorem 1 here (see the paper for context):

If the estimated reward function is $\;\hat{R}(s) = w^T\phi(s),\;$ the true reward function is \(\;R^*(s) = \hat{R}(s) + \epsilon(s)\;\) for some error function \(\;\epsilon : \mathcal{S} \to \mathbb{R}\;\) and \(\;\|w\|_1 \le 1,\;\) then extrapolation beyond the demonstrator, i.e., \(\; J(\hat{\pi}|R^*) > J(\mathcal{D}|R^*),\;\) is guaranteed if:

\[J(\pi_{R^*}^*|R^*) - J(\mathcal{D}|R^*) > \epsilon_\Phi + \frac{2\|\epsilon\|_\infty}{1 - \gamma}\]

where \(\;\pi_{R^*}^* \;\) is the optimal policy under $R^*$, \(\;\epsilon_\Phi = \| \Phi_{\pi_{R^*}^*} - \Phi_{\hat{\pi}}\|_\infty,\;\) and \(\|\epsilon\|_\infty = {\rm sup}\{ | \epsilon(s)| : s \in \mathcal{S} \}\).

To clarify the theorem, $\hat{\pi}$ is some learned policy for which we want to outperform the average episodic return in the demonstration data $J(\mathcal{D}|R^*)$. We begin by considering the difference in return between the optimal policy under the true reward (which can’t be exceeded w.r.t. that reward by definition) and the expected return of the learned polcy (also under that true reward):

\[\begin{align} J(\pi_{R^*}^*|R^*) - J(\hat{\pi}|R^*) \;&{\overset{(i)}=}\;\; \left| \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t R^*(s) \Big] - \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t R^*(s) \Big] \right| \\ \;&{\overset{(ii)}=}\;\; \left| \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t (w^T\phi(s_t)+\epsilon(s_t)) \Big] - \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t (w^T\phi(s_t)+\epsilon(s_t)) \Big] \right| \\ \;&{\overset{(iii)}=}\; \left| w^T\Phi_{\pi_{R^*}^*} + \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t \epsilon(s_t) \Big] - w^T\Phi_{\hat{\pi}} - \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t \epsilon(s_t) \Big] \right| \\ \;&{\overset{(iv)}\le}\;\; \left| w^T(\Phi_{\pi_{R^*}^*} -\Phi_{\hat{\pi}}) + \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t \sup_{s\in \mathcal{S}} \epsilon(s) \Big] - \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t \inf_{s \in \mathcal{S}} \epsilon(s) \Big] \right| \\ \;&{\overset{(v)}=}\;\; \left| w^T(\Phi_{\pi_{R^*}^*} -\Phi_{\hat{\pi}}) + \Big( \sup_{s\in \mathcal{S}} \epsilon(s) - \inf_{s \in \mathcal{S}} \epsilon(s) \Big) \sum_{t=0}^{\infty} \gamma^t \right| \\ \;&{\overset{(vi)}\le}\;\; \left| w^T(\Phi_{\pi_{R^*}^*} -\Phi_{\hat{\pi}}) + \frac{2 \|\epsilon\|_\infty}{1-\gamma} \right| \\ \;&{\overset{(vii)}\le}\;\; \left| w^T(\Phi_{\pi_{R^*}^*} -\Phi_{\hat{\pi}})\right| + \frac{2 \|\epsilon\|_\infty}{1-\gamma} \\ \;&{\overset{(viii)}\le}\; \|w\|_1 \|\Phi_{\pi_{R^*}^*} -\Phi_{\hat{\pi}})\|_\infty + \frac{2 \|\epsilon\|_\infty}{1-\gamma} \\ &{\overset{(ix)}\le}\; \epsilon_\Phi + \frac{2\|\epsilon\|_\infty}{1 - \gamma} \end{align}\]

where

  • in (i), we apply the definition of the terms and put absolute values around the terms. I don’t think this is necessary since the LHS must be positive, but it doesn’t hurt.

  • in (ii), we substitute $R^*$ with the theorem’s assumption about both the error function and how the estimated reward is a linear combination of features.

  • in (iii) we move the weights $w$ outside the expectation as they are constants and we can use linearity of expectation. Then we use the paper’s definition of $\Phi_\pi$ as the expected feature counts for given policy $\pi$.

  • in (iv) we move the two $\Phi$ terms together (notice how this matches the theorem’s $\epsilon_\Phi$ definition), and we then make this an inequality by looking at the expectations and applying “sup”s and “infs” to each time step. This is saying if we have $A-B$ then let’s make the $A$ term larger and the $B$ term smaller. Since we’re doing this for an infinite amount of time steps, I am somewhat worried that this is a loose bound.

  • in (v) we see that since the “sup” and “inf” terms no longer depend on $t$, we can move them outside the expectations. In fact, we don’t even need expectations anymore, since all that’s left is a sum over discounted $\gamma$ terms.

  • in (vi) we apply the geometric series formula to get rid of the sum over $\gamma$ and then the inequality results from replacing the “sup”s and “inf”s with the \(\| \epsilon \|_\infty\) from the theorem statement – the “2” helps to cover the extremes of a large positive error and a large negative error (note the absolute value in the theorem condition, that’s important).

  • in (vii) we apply the Triangle Inequality.

  • in (viii) we apply Hölder’s inequality.

  • finally, in (ix) we apply the theorem statements.

We now take that final inequality and subtract the average demonstration data return on both sides:

\[\underbrace{J(\pi_{R^*}^*|R^*)- J(\mathcal{D}|R^*)}_{\delta} - J(\hat{\pi}|R^*) \le \epsilon_\Phi + \frac{2\|\epsilon\|_\infty}{1 - \gamma} - J(\mathcal{D}|R^*)\]

Now we finally invoke the “if” condition in the theorem. If the equation in the theorem holds, then we can replace $\delta$ above as follows since it’s just reducing the LHS:

\[\epsilon_\Phi + \frac{2\|\epsilon\|_\infty}{1 - \gamma} - J(\hat{\pi}|R^*) \le \epsilon_\Phi + \frac{2\|\epsilon\|_\infty}{1 - \gamma} - J(\mathcal{D}|R^*)\]

which implies:

\[- J(\hat{\pi}|R^*) \le - J(\mathcal{D}|R^*) \quad \Longrightarrow \quad J(\hat{\pi}|R^*) > J(\mathcal{D}|R^*),\]

showing that $\hat{\pi}$ has extrapolated beyond the data.

What’s the intuition behind the theorem? The LHS of the theorem shows the difference in the return based on the optimal policy versus the demonstration data. By definition of optimality, the LHS is at least 0, but it can get very close to 0 if the demonstration data is very good. That’s not good for extrapolation, and hence the condition for outperforming the demonstrator is less likely to hold (which makes sense). Focusing on the RHS, we see that it’s value is larger if the maximum error in $\epsilon$ is large. This might be a very restrictive condition, since it’s considering the maximum absolute error over the entire state set $\mathcal{S}$. Since there are an infinite amount of states in many practical applications, this means even one large error might cause the inequality in the theorem statement to fail.

The proof also relies on the assumption that the estimated reward function is a linear combination of features (that’s what $\hat{R}(s)=w^T\phi(s)$ means) but $\phi$ could contain arbitrarily complex features, so I guess it’s a weak assumption (which is good), but I am not sure?

Concluding Remarks

Overall, the T-REX and D-REX papers are nice IRL papers that rely on preferences between trajectories. The takeaways I get from these works:

  • While reinforcement learning may be very exciting, don’t forget about the perhaps lesser-known task of inverse reinforcement learning.

  • Taking subsamples of trajectories is a helpful way to do data augmentation when doing anything at the granularity of episodes.

  • Perhaps most importantly, I should understand when and how preference rankings might be applicable and beneficial. In these works, preferences enable them to train an agent to perform better than demonstrator data without strictly requiring ground truth environment rewards, and potentially without even requiring demonstrator actions (though D-REX requires actions).

I hope you found this post helpful. As always, thank you for reading, and stay safe.


Papers covered in this blog post:










Research Talk at the University of Toronto on Robotic Manipulation

Mar 21, 2021


A video of my talk at the University of Toronto with the Q-and-A at the end.

Last week, I was very fortunate to give a talk “at” the University of Toronto in their AI in Robotics Reading Group. It gives a representative overview of my recent research in robotic manipulation. It’s a technical research talk, but still somewhat high-level, so hopefully it should be accessible to a broad range of robotics researchers. I normally feel embarrassed when watching recordings of my talks, since I realize I should have done X instead of Y in so many places. Fortunately I think this one turned out reasonably well. Furthermore, and to my delight, the YouTube / Google automatic captions captured my audio with a high degree of accuracy.

My talk covers these three papers in order:

We covered the first two papers in a BAIR Blog post last year. I briefly mentioned the last one in a personal blog post a few months ago, with the accompanying backstory behind how we developed it. A joint Google AI and BAIR Blog post is in progress … I promise!

Regarding that third paper (for ICRA 2021), when making this talk in Keynote, I was finally able to create the kind of animation that shows the intuition for how a Goal-Conditioned Transporter Network works. Using Google Slides is great for drafting talks quickly, but I think Keynote is better for formal presentations.

I thank the organizers (Homanga Bharadhwaj, Arthur Allshire, Nishkrit Desai, and Professor Animesh Garg) for the opportunity, and I also thank them for helping to arrange the two sign language interpreters for my talk. Finally, if you found this talk interesting, I encourage you to view the talks from the other presenters in the series.










Getting Started with SoftGym for Deformable Object Manipulation

Feb 20, 2021


Visualization of the PourWater environment from SoftGym. The animation is from the project website.

Over the last few years, I have enjoyed working on deformable object manipulation for robotics. In particular, it was the focus of my Google internship work, and I previously did some work with deformables before that, highlighted with our BAIR Blog post here. In this post, I’d like to discuss the SoftGym simulator, developed by researchers from Carnegie Mellon University in their CoRL 2020 paper. I’ve been exploring this simulator to see if it might be useful for my future projects, and I am impressed by the simulation quality and how it also has support for fluid simulation. The project website has more information and includes impressive videos. This blog post will be similar in spirit to one I wrote almost a year ago about using a different code base (rlpyt) with a focus on the installation steps for SoftGym.

Installing SoftGym

The first step is to install SoftGym. The provided README has some information but it wasn’t initially clear to me, as shown in my GitHub issue report. As I stated in my post on rlpyt, I like making long and detailed GitHub issue reports that are exactly reproducible.

The main thing to understand when installing is that if you’re using an Ubuntu 16.04 machine, you (probably) don’t have to use Docker. (However, Docker is incredibly useful in its own right, so I encourage you to learn how to use it if you haven’t done so already.) If you’re using Ubuntu 18.04, then you definitely have to use Docker. However, Docker is only used to compile PyFleX, which has the physics simulation for deformables. The rest of the repository can be managed through the usual conda environment.

Here’s a walk-through of my installation steps on an Ubuntu 18.04 machine, and I assume that conda is already installed. So far, the code has worked for me on a variety of CUDA and NVIDIA driver versions. You can find the CUDA version by running:

seita@mason:~ $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

For example, the above means I have CUDA 10.0. Similarly, the driver version can be found from running nvidia-smi.

Now let’s get started by cloning the repository and then creating the conda environment:

conda env create -f environment.yml

This command will create a conda environment that has the necessary packages with their correct version. However, there’s one more package to install, the pybind11 package, so I would install that after activating the environment:

conda activate softgym
conda install pybind11

At this point, the conda environment should be good to go.

Next we have the most interesting part, where we use Docker. Here’s the installation guide for Ubuntu machines in case it’s not installed on your machine yet. I’m using Docker version 19.03.6. A quick refresher on terminology: Docker has images and containers. An image is like a recipe, whereas a container is an instance of it. StackOverflow has a more detailed explanation. Therefore, after running this command:

docker pull xingyu/softgym

we are downloading the author’s pre-provided Docker image, and it should be listed if you type in docker images on the command line:

seita@mason:~$ docker images
REPOSITORY                           TAG                             IMAGE ID            CREATED             SIZE
xingyu/softgym                       latest                          2cbcd6a50965        3 months ago        2.44GB

If you’re running into issues with requiring “sudo”, you can mitigate this by adding yourself to a “Docker group” so that you don’t have to type it in each time.

Next, we have to run a command to start a container. Here, we’re using nvidia-docker since this requires CUDA, as one would expect given that FleX is from NVIDIA. This is not installed when you install Docker, so please refer to this page for installation instructions. Once that’s done, to be safe, I would check to make sure that nvidia-docker -v works on your command line and that the version matches what’s printed from docker -v.

For SoftGym, here is the command I use:

(softgym) seita@mason:~/softgym$ nvidia-docker run \
    -v /home/seita/softgym:/workspace/softgym \
    -v /home/seita/miniconda3:/home/seita/miniconda3 \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    --gpus all \
    -e DISPLAY=$DISPLAY \
    -e QT_X11_NO_MITSHM=1 \
    -it xingyu/softgym:latest bash

Here’s an explanation:

  • The first -v will mount /home/seita/softgym (i.e., where I cloned softgym) to /workspace/softgym inside the Docker container’s file system. Thus, when I enter the container, I can change directory to /workspace/softgym and it will look as if I am in /home/seita/softgym on the original machine. The /workspace seems to be the default directory we start in Docker containers.
  • A similar thing happens with the second mounting command for miniconda. In fact I’m using the same exact directory before and after the colon, which means the directory structure is the same inside the container.
  • The -it and bash portions will create an environment in the container which lets us type in things on the command line, like with normal Ubuntu machines. Here, we will be the root user. The Docker documentation has more information about these arguments. Note that -it is shorthand for -i -t.
  • The other commands are copied from the SoftGym Docker README.

Running the command means I enter a Docker container as a “root” user, and you should be able to see this container listed if you type in docker ps in another tab (outside of Docker) since that shows the activate container IDs. At this point, we should go to the softgym directory and run the scripts to (1) prepare paths and (2) compile PyFleX:

root@82ab689d1497:/workspace# cd softgym/
root@82ab689d1497:/workspace/softgym# export PATH="/home/seita/miniconda3/bin:$PATH"
root@82ab689d1497:/workspace/softgym# . ./prepare_1.0.sh
(softgym) root@82ab689d1497:/workspace/softgym# . ./compile_1.0.sh

The above should compile without errors. That’s it! We can then exit Docker (just type in “exit”).

If you’re using Ubuntu 16.04, the steps should be similar but also much simpler, and here is the command history that I have when using it:

git clone https://github.com/Xingyu-Lin/softgym.git
cd softgym/
conda env create -f environment.yml
conda activate softgym
. ./prepare_1.0.sh
. ./compile_1.0.sh
cd ../../..

The last change directory command is because the compile script changes my path. Just go back to the softgym/ directory and you’ll be ready to run.

Code Usage

Back in our normal Ubuntu 18.04 command line setting, we should make sure our conda environment is activated, and that paths are set up appropriately:

(softgym) seita@mason:~/softgym$ export PYFLEXROOT=${PWD}/PyFlex
(softgym) seita@mason:~/softgym$ export PYTHONPATH=${PYFLEXROOT}/bindings/build:$PYTHONPATH
(softgym) seita@mason:~/softgym$ export LD_LIBRARY_PATH=${PYFLEXROOT}/external/SDL2-2.0.4/lib/x64:$LD_LIBRARY_PATH

To make things easier, you can use a script like prepare-1.0.sh to adjust paths for you, so that you don’t have to keep typing in these “export” commands manually.

Finally, we have to turn on headless mode for SoftGym if running over a remote machine. This was a step that tripped me up for a while, even though I’m usually good about remembering this after having gone through similar issues using the Blender simulator (for rendering fabric images remotely). Commands like this should hopefully work, which run the chosen environment and have the agent take random actions:

(softgym) seita@mason:~/softgym$ python examples/random_env.py --env_name ClothFlatten --headless 1

If you are running on a local machine with a compatible GPU, you can remove the headless option to have the animation play in a new window. Be warned, though: the size of the window should remain fixed throughout, since the code appends frames together, so don’t drag and resize the window. You can right click on the mouse to change the camera angle, and use W-A-S-D keyboard keys to navigate.

Long story short, SoftGym contains one of the nicest looking physics simulators I’ve seen for deformable objects. I also really like the support for liquids. I can imagine future robots transporting boxes and bags of liquids.

Working and Non-Working Configurations

I’ve tried installing Docker on a number of machines. To summarize, here are all the working configurations, which are tested by running the examples/random_env.py script:

  • Ubuntu 16.04, CUDA 9.0, NVIDIA 440.33.01, no Docker at all.
  • Ubuntu 18.04, CUDA 10.0. NVIDIA 450.102.04, only use Docker for installing PyFleX.
  • Ubuntu 18.04, CUDA 10.1. NVIDIA 430.50, only use Docker for installing PyFleX.
  • Ubuntu 18.04, CUDA 10.1. NVIDIA 450.102.04, only use Docker for installing PyFleX.

Unfortunately, I have run into a case where SoftGym does not seem to work:

  • Ubuntu 16.04, CUDA 10.0, NVIDIA 440.33.01, no Docker at all. The only difference from a working setting above is that it’s CUDA 10.0 instead of 9.0. This setting is resulting in:
Waiting to generate environment variations. May take 1 minute for each variation...
*** stack smashing detected ***: python terminated
Aborted (core dumped)

I have yet to figure out how to fix this. If you’ve found and addressed this fix, it would be nice to inform the code maintainers.

The Code Itself

The code does not seem to include their reinforcement learning benchmarks. Unfortunately, that’s in a separate code base which is not yet public. In SoftGym, there is a basic pick and place action space with fictitious grippers, which may be enough for preliminary usage.

Fortunately, the code is fairly readable. There’s a FlexEnv class and a sensible class hierarchy for the different types of deformables supported – rope, cloth, and liquids. Here’s how the classes are structured, with parenting relationships based on the indentation below:

FlexEnv
    RopeNewEnv
        RopeFlattenEnv
            RopeConfigurationEnv
    ClothEnv
        ClothDropEnv
        ClothFlattenEnv
        ClothFoldEnv
            ClothFoldCrumpledEnv
            ClothFoldDropEnv
    FluidEnv
        PassWater1DEnv
        PourWaterPosControlEnv
        PourWaterAmountPosControlEnv

I am in the process of figuring out how to seamlessly add items together to customize my reinforcement learning environments. The code maintainers responded to some questions I had in this GitHub issue report about making new environments. The summary is that (1) this appears to require knowledge of how to use a separate library, PyFleX, and (2) when we make new environments, we have to make new header files with the correct combination of objects we want, and then re-compile PyFleX.

Conclusion

I hope this blog post can be of assistance when getting started with SoftGym. I am excited to see what researchers try with it going forward, and I’m grateful to be in a field where simulation for robotics is an activate area of research.










Five New Research Preprints Available

Jan 3, 2021


The video for the paper "Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks."

The Fall 2020 semester was an especially busy one, since I was involved in multiple paper submissions with my outstanding collaborators. Five preprints are now available, and this post summarizes each of these, along with some of the backstories behind the papers. In all cases, arXiv should have the most recent, up to date version of the paper.

The bulk of this work was actually done in Spring 2020, but we’ve made some significant improvements in the latest version on arXiv by expanding the experiments and improving the writing. The main idea in this paper is to use dense object descriptors (see my blog post here) in simulation to get correspondences between two different images of the same object, which in our case would be fabrics. If we see two images of the same fabric, but where the fabric’s appearance may be different in the two images (e.g., having a fold versus no fold), we would like to know which pixels in image 1 correspond to pixels in image 2, in the sense that the correspondence will give us the same part of the fabric. We can use the learned correspondences to design robot policies that smooth and fold real fabric, and we can even do this in real environments with the aid of domain randomization.

I was originally hoping to include this paper in our May 2020 BAIR Blog post on fabric manipulation, but the blog authors and I decided against this, since this paper doesn’t neatly fit into the “model-free” vs “model-based” categorization.

This paper proposes Intermittent Visual Servoing (IVS), a framework which uses a coarse controller in free space, but employs imitation learning to learn precise actions in regions that have the highest accuracy requirements. Intuitively, many tasks are characterized by some “bottleneck points”, such as tightening a screw, and we’d like to specialize the learning portion for those areas.

To benchmark IVS, we test on a surgical robot, and train it to autonomously perform surgical peg transfer. For some context: peg transfer is a task commonly used as part of a curriculum to train human surgeons for robot surgery. Robots are commonly used in surgery today, but in all cases, these involve a human manipulating tools, which then cause the surgical robot to move in known directions. This process is specifically referred to as “teleoperation.”

For our automated surgical robot on peg transfer, we show high success rates, and transferability of the learned model across multiple surgical arms. The latter is a known challenge as different surgical arm tools have different mechanical properties, so it was not clear to us if off-the-shelf IVS could work, but it did!

This paper is an extension of our ISMR 2020 and IEEE RA-Letters 2020 papers, which also experiment with surgical peg transfer. It therefore relates to the prior paper on Intermittent Visual Servoing, though I would not call it an extension of that paper, since we don’t actually apply IVS here, nor do we test transferability across different surgical robot arms.

In this work, we use depth sensing, recurrent neural networks, and a new trajectory optimizer (thanks to Jeff Ichnowski) to get an automated surgical robot to outperform a human surgical resident on the peg transfer task. In this and our ISMR 2020 paper, Danyal Fer acted as the human surgical resident. For our ISMR 2020 paper, we couldn’t get the surgical robot to be as good as him on peg transfer, prompting this frequent internal comment among us: Danyal, how are you so good??

Well, with the combination of these new techniques, plus terrific engineering work from postdoc Minho Hwang, we finally obtained accuracy and timing results at or better than those Danyal Fer obtained. I am looking forward to seeing how far we can push ahead in surgical robotics in 2021.

This shows a cool application of using a UR5 arm to perform high speed dynamic rope manipulation tasks. Check out the video of the paper (on the project website), which comes complete with some Indiana Jones style robot whipping demonstrations. We also name the proposed learning algorithm in the paper using the INDY acronym, for obvious reasons.

The first question I would have when thinking about robots whipping rope is: how do we define an action? We decided on a simple yet flexible enough approach that worked for whipping, vaulting, and weaving tasks: a parabolic action motion coupled with a prediction of the single apex point of this motion. The main inspiration for this came from the “TossingBot” paper from Andy Zeng, which used a similar idea for parameterizing a tossing action. That brings us to the fifth and final paper featured in this blog post …

Here, we finally have the one paper where I’m the first author, and the one for which I expended the bulk of my research efforts. (You can imagine what my work days were like last fall, with me working on this paper in the mornings and afternoons, followed by the other papers above in the evenings.) This paper came out of my Summer 2020 virtual internship with Google Robotics, where I was hosted by the great Andy Zeng. Before the internship began, Andy and I knew we wanted to work on deformable object manipulation, and we thought it would be nice to show a robot manipulating bags, since that would be novel. But we weren’t sure what method to use to train the robot.

Fortunately, at that time, Andy was hard at work on something called Transporter Networks. It ended up as one of the top papers presented at CoRL 2020. Andy and I hypothesized that Transporter Networks could work well on a wide range of deformable object manipulation tasks. So, I designed over a dozen simulated environments using PyBullet that included the full suite of 1D, 2D, and 3D deformables. We were actually thinking of using Blender before the internship, but at some point I realized that Blender would not be suitable. Pivoting to PyBullet, though painful initially, proved to be one of the best decisions we made.

While working on the project, Andy and I wanted to increase the flexibility of Transporter Networks to different task specifications. That’s where the “goal-conditioned” version came from. There are multiple ways of specifying goals; here, we decided to specify an image of the desired rearrangement configuration.

Once we had the architectures and the simulated tasks set up, it was a matter of finding the necessary compute to run the experiments, and iterating upon the design and tasks.

I am very pleased with how the paper came out to be, and I also hope to release a more detailed blog post about this paper, both here and on the BAIR and Google AI blogs. I also really enjoyed working with this team; I have not met any of the Google-affiliated authors in person, so I look forward to the day when the pandemic subsides.

I hope you find these papers interesting! If you have questions or would like to discuss topics in this papers further, feel free to reach out.










All the Books I Read in 2020, Plus My Thoughts

Jan 1, 2021

Every year I have a tradition where I try to write down all the books I read, and to summarize my thoughts. Despite how 2020 was quite different from years past, I was able to get away from the distractions of the world by diving into books. I have listed 40 books here:

  • Popular Science (6 books)
  • Current Events (4 books)
  • Business and Technology (5 books)
  • China (4 books)
  • Race and Anti-Racism (5 books)
  • Countries (4 books)
  • Psychology and Psychiatry (4 books)
  • Miscellaneous (8 books)

The total is similar to past years (2016 through 2019): 34, 43, 35, 37. As always you can find prior summaries in the archives. I tried to cut down on the length of the summaries this year, but I was only partially successful.

Group 1: Popular Science

Every year, I try to find a batch of books that quenches my scientific curiosity.

  • ** Physics of the Future: How Science Will Shape Human Destiny and Our Daily Lives by the Year 2100 ** (2011) blew me away via a whirlwind tour of the future. Michio Kaku, a famous theoretical physicist and CUNY professor, attempts to predict 2100. Kaku’s vision relies on (a) what is attainable subject to the laws of physics, and (b) interviews with hundreds of leading scientific experts, including many whose names I recognize. Crudely, one can think of Physics of the Future as a more general vision of Ray Kurzweil’s How to Create a Mind (discussed below) in that Kurzweil specializes in AI and neuroscience, whereas Kaku focuses on a wider variety of subjects. Physics of the Future has separate chapters on: Computers, AI, Medicine, Nanotech, Energy, Space Travel, Wealth, Humanity, and then the last one is about “Day in the Life of 2100.” Kaku breaks down each subject into what he thinks will happen in (a) the near future to 2030, (b) then later in 2030-2070, and (c) from 2070-2100. For example, in the chapter on computers, much discussion is spent on the limits of current silicon-based CPUs, since we are hitting the theoretical limit of how many transistors we can insert in a chip of silicon, which is why there’s been much effort on going beyond Moore’s Law, such as parallel programming and quantum computing. In the AI chapter, which includes robotics, there is a brief mention of learning-based versus “classical” approaches to creating AI. If Kaku had written this book just a few years later, this chapter would look very different. In biology and medicine, Kaku is correct in that we will try to build upon advances in gene therapy and extend the human lifespan, which might (and this is big “might”) be possible with the more recent CRISPR technologies (not mentioned in the book, of course). While my area of expertise isn’t in biology and medicine, or the later chapters on nanotechnology and energy, by the time I finished this book, I was in awe of Kaku’s vision of the future, but also somewhat tempered by the enormous challenges ahead of us. For a more recent take on Kaku’s perspective, here is a one-hour conversation on Lex Fridman’s podcast where he mentions CRISPR-like technologies will let humans live forever by identifying “mistakes” in cells (i.e., the reason why we die). I’m not quite as optimistic as Kaku is on that prospect, but I share his excitement of science.

  • ** How to Create a Mind: The Secret of Human Thought Revealed ** (2012) by the world’s most famous futurist, Ray Kurzweil. While his most popular book is The Singularity is Near from 2005, this shorter book — a follow-up in some ways — is a pleasure to read. In How to Create a Mind Kurzweil focuses on reverse-engineering the brain by conjecturing how the brain works, and how the process could be emulated in a computer. The aspiration is obvious: if we can do this, then perhaps we can create intelligent life. If, in practice, machines “trick” people into thinking they are real brains with real thought, then Kurzweil argues that for all practical purposes they are conscious (see Chapter 9).1 There was some discussion about split-brain patients and the like, which overlaps with some material in Incognito, which I read in 2017. Throughout the book, there is emphasis on the neocortex, which according to Wikipedia, plays a fundamental role in learning and memory. Kurzweil claims it acts as a pattern recognizer, and that there’s a hierarchy to let us conduct higher-order reasoning. This makes sense, and Kurzweil spends a lot of effort describing ways we can simulate the neocortex. That’s not to say the book is 100% correct or prescient. He frequently mentions Hidden Markov Models (HMMs), but I hardly ever read about them nowadays. Perhaps the last time I actually implemented HMMs was for a speech recognition homework assignment in the Berkeley graduate Natural Language Processing course back in 2014. The famous AlexNet paper was published just a few months after this book was published, catalyzing the Deep Learning boom. Also, Kruzweil’s prediction that self-driving cars would be here “by the end of the decade” were wildly off. I think it’s unlikely we will see them publicly available even by the end of this new decade, in December of 2029. But he also argues that as of 2012, the trends from The Singularity is Near are continuing, with updated plots showing that once a technology becomes an information technology then the “law of accelerating returns” will kick in, creating exponential growth. There are “arguments against incredulity,” as argued by the late Paul Allen. Kurzweil spends the last chapter refuting Allen’s arguments. I want to see an updated 2021 edition of Kurzweil’s opinions on topics in this book, just like I do for Kaku’s book.

  • ** A Crack in Creation: Gene Editing and the Unthinkable Power to Control Evolution ** (2017) by Berkeley Professor Jennifer A Doudna and her former PhD student Samuel H Sternberg (now at Columbia University). The Doudna lab has a website with EIGHTEEN postdocs at the time of me reading this! I’m sure that can’t be the norm, since Doudna is one of the stars of the Berkeley chemistry department and recently won the 2020 Nobel Prize in Chemistry. This book is about the revolutionary technology called CRISPR. The first half provides technical background, and the second half describes the consequences, both the good (what diseases it may cure) and the bad (ethics and dangers). In prior decades, I remember hearing about “gene therapy,” but CRISPR is “gene editing” — it is far easier to use CRISPR to edit genes than any prior technology, which is one of the reasons why it has garnered widespread attention since a famous 2012 Science paper by Doudna and her colleagues. The book provides intuition showing how CRISPR works to edit genes, though as with anything, it will be easier to understand for people who work in this field. The second half of the book is more accessible and brings up the causes of concern: designer babies, eugenics, and so on. My stance is probably similar to Doudna and of most scientists in that I support investigating the technology with appropriate restrictions. A Crack in Creation was published in 2017, and already in November 2018, there was a story that broke (see MIT Review, and NYTimes articles) about the scientist He Jiankui who claimed to create the first gene-edited humans. The field is moving so fast, and reading this book made it clear the obvious similarities between CRISPR and AI technologies and how they are (a) growing so powerful and (b) require safety and ethical considerations. Sadly, I also see how CRISPR can lead to battle lines over who has credit for the technology; in AI, we have a huge problem with “flag planting” and “credit assignment” and I hope this does not damage the biochemistry field. I am also curious about the relationship between CRISPR and polygenic scores,2 which were discussed in the book Blueprint (see my thoughts here). I wish there were more books like A Crack in Creation.

  • ** Scale: The Universal Laws of Growth, Innovation, Sustainability, and the Pace of Life in Organisms, Cities, Economies, and Companies ** (2017) is one of my favorites this year. By Geoffrey West, a Santa Fe Institute theoretical physicist, and who’s more accurately described as a “jack of all trades,” the book unifies the theme of “scale” across organisms, cities, and companies. It asks questions like: why aren’t there mammals the size of Godzilla? Why aren’t humans living for 200 years? How does income and crime scale with city size? Any reputable scientist can answer the question about Godzilla: anything Godzilla’s would not be able to support itself, unless it were somehow made of “different” living material. West’s key insights are to relate this to an overarching theme of exponential growth and scaling. For example, consider networks and capillaries. Mammals have hearts that pump blood into areas of the body, with the vessel size decreasing up to the capillaries at the end. But across all mammals, the capillaries at the “ends” of this system are roughly the same size, and optimize the “reachability” of a system. Furthermore, this is similar to a water system in a city, so perhaps the organization and size limitations of cities are similar to those of mammals. Another key finding is that many attributes of life are constant across many organisms. Take the number of heartbeats in a mammal’s lifespan. Smaller mammals have much faster heart rates, whereas bigger mammals have much slower heart rates, yet the number of heart beats is roughly the same across an enormous variety of organisms. That factor, along with mortality rates for humans, suggests a natural limit to human lifespans, so West is skeptical that humans will live far beyond the current record of 122 years. Scale is filled with charts showing various qualities that are consistent across organisms, cities, and companies, and which also demonstrate exponential growth. It reminds me of Steven Pinker’s style of research in adding quantitative metrics to social sciences research. West’s concludes with disconcerting discussions about whether humanity can continue accelerating at the superexponential rate we’ve been living. While careful not to fall under the “Malthusian trap,” he’s concerned that the environment will no longer be able to support our rate of living. Scale is a great book from one of academia’s brightest minds that manages to make the scientific details into something readable. If you don’t have the time to read 450+ pages, then his 2011 TED Talk might be a useful alternative.

  • ** The Book of Why: The New Science of Cause and Effect ** (2018) is by 2011 Turing Award winner Judea Pearl, a professor at UCLA and a leading researcher in AI, along with science writer Dana MacKenzie3. I first remember reading about Pearl’s pioneering work in Bayesian Networks when I was an undergrad trying (unsuccessfully) to do machine learning research. To my delight, Bayesian Networks are featured in The Book of Why, and I have fond memories of studying them for the Berkeley AI Prelims. Ah. Pearl uses a metaphor of a ladder with three rungs that describe understanding. The first rung is where the current Deep Learning “revolution” lies, and relates to pattern matching. In the second rung, a machine must be able to determine what happens when something is applied. Finally, the third and most interesting rung is on counterfactual inference: what happened if, instead of \(X\), we actually did \(Y\)? It requires us to imagine a world that did not exist, and Pearl argues that this thinking is essential to create advanced forms of AI. Pearl is an outspoken skeptic of the “Big Data” trend, where one just looks at the data to find a conclusion. So this book is his way of expressing his journey through causal inference to a wider audience, where he introduces the “\(P(X | do(Y))\)” operator (in contrast to \(P(X | Y)\)), how to disentangle the effect of confounding, and how to perform counterfactual inference. What is the takeaway? I’m judging the “Turing Award” designation correctly, it seems like Pearl’s work and causality is widely accepted or at least not vigorously opposed by those in the community, so I guess it’s been a success? I should also have anticipated that Andrew Gelman would review the book on his famous blog with some mixed reactions. To summarize (and I might share this view) while The Book of Why brings many interesting points, it may read too much as someone who’s reveling in his “conquering” of “establishment statisticians,” which might turn off readers. Some of the text is also over-claiming: the book says causality can help with smoking, taxes, climate change, and so forth, but those can arguably be done without necessarily resorting to the exact causal inference machinery.

  • ** Human Compatible: Artificial Intelligence and the Problem of Control ** (2019) is by Berkeley computer science professor Stuart Russell and a leading authority on AI. Before the pandemic, I frequently saw Prof. Russell as our offices are finally on the same floor, and I enjoyed reading and blogging about his textbook (soon to be updated!) back when I was studying for the AI prelims. A key message from Human Compatible is that we need to be careful when designing AI. Russell argues: “machines are beneficial to the extent that their actions can be expected to achieve our objectives”. In other words, we want robots to achieve our intended objectives, which is not necessarily — and usually is not! — what we exactly specified in the objective through a cost or reward function. Instead of this, the AI field has essentially been trying to make intelligent machines achieve “the machine’s” objective. This is problematic in several ways, one of which is that humans are bad at specifying their intents. A popular example of this is in OpenAI’s post about faulty reward functions. The BAIR blog has similar content in this post and a related post (by Stuart Russell’s students, obviously). As AI becomes more powerful, mis-specified objective functions have greater potential for negative consequences, hence the need to address this and other mis-uses of AI (e.g., see Chapter 4 and lethal autonomous weapons). There are a range of possible techniques for obtaining provably beneficial AI, such as making machines “turn themselves off” and ensuring they don’t block that, or having machines ask humans for assistance in uncertain cases, or having machines learn human preferences. Above all, Russell makes a convincing case for human-compatible AI discourse, and I recommend the book to my AI colleagues and to the broader public.

Group 2: Current Events

These are recent books covering current events.

  • ** Factfulness: Ten Reasons We’re Wrong About the World — and Why Things Are Better Than You Think ** (2018) by the late Hans Rosling, who died of cancer and was just able to finish this book in time with his family. Hans Rosling was a Swedish physician and academic, and from the public’s view, may be best known for his data visualization techniques4 to explain why many of us in so-called “developed countries” have misconceptions about “developing countries” and the world more broadly. (Look him up online and watch his talks, for example this TED talk.) The ten reasons in Factfulness are described as “instincts”: gap, negativity, straight line, fear, size, generalization, destiny, single perspective, blame, and urgency. In discussing these points, Rosling urges us to dispense with the terms “developing” and “developed” and instead to use a four-level scale, with most of the world today on “Level 2” (and the United States on “Level 4”). Rosling predicts that in 2040, most of the world will be on Level 3. Overall, this book is similar to Steven Pinker’s Better Angels and Enlightenment Now so if you like those two, as I did, you will probably like Factfulness. However, there might not be as much novelty. I want to conclude with two thoughts. The criticism of “cherry-picking facts” is both correct but also unfair since any book that covers a topic as broadly as the state of the world will be forced to do so. Second, while reading this book, I think there is a risk of focusing too much on countries that have a much lower baseline of prosperity to begin with (e.g., countries on Level 1 and 2) and it would be nice to see if we can get similarly positive news for countries which are often viewed as “wealthy but stagnant” today, such as Japan and (in many ways) the United States. Put another way, can we develop a book like Factfulness that will resonate with factory workers in the United States who have lost jobs due to globalization, or people lamenting soaring income inequality?

  • ** The Coddling of the American Mind: How Good Intentions and Bad Ideas are Setting Up a Generation for Failure ** (2018) was terrific. It’s written by Greg Lukianoff, a First Amendment lawyer specializing in free speech on campuses, and Jonathan Haidt, a psychology professor at NYU, and one of the most well-known in his field. For perspective, I was aware of Haidt before reading this book. Coddling of the American Mind is an extended version of their article in The Atlantic, which introduced their main hypothesis that the trend of protecting students from ideas they don’t like is counterproductive. Lukianoff and Haidt expected a wave of criticism after their article, but it seemed like there was agreement from across the political spectrum. They emphasize how much of the debate over free speech on college campuses is a debate within the political left, given the declining proportion of conservative students and faculty. The simple explanation is that the younger generation disagrees with older liberals, the latter of whom generally favor freer speech. The book mentions both my undergrad, Williams College, and my graduate school, the University of California, Berkeley, since both institutions have faced issues with free speech and inviting conservative speakers to campus. More severe were the incidents at Evergreen State, though fortunately what happened there was far from typical. Lukianoff and Haidt also frequently reference Jean Twenge’s book IGen: Why Today’s Super-Connected Kids Are Growing Up Less Rebellious, More Tolerant, Less Happy – and Completely Unprepared for Adulthood – and What That Means for the Rest of Us, with a long self-explanatory subtitle. I raced through The Codding of the American Mind and will definitely keep it in mind for my own future. Like Haidt, I generally identify with the political left, but I read a fair amount of conservative writing and feel like I have significantly benefited from doing so. I also generally oppose disinviting speakers, or “cancel culture” more broadly. This book was definitely a favorite of mine this year. The title is unfortunate, as the “coddling” terminology might cause the people who would benefit the most to avoid reading it.

  • ** The Tyranny of Merit: What’s Become of the Common Good? ** (2020) by Michael J. Sandel, a Professor of Government at Harvard University who teaches political philosophy. Sandel’s objective is to inform us about the dark side of meritocracy. Whereas in the past, being a high-status person in American society was mainly due to being white, male, and wealthy, nowadays America’s educational system has changed to a largely merit-based one, however one defines “merit.” But for all these changes, we still have low income mobility, where the children of the wealthy and highly educated are likely to remain in high status professions, and the poor are likely to remain poor. Part of this is because elite colleges and universities are still overly-represented by the wealthy. But, argues Sandel, even if we achieve true meritocracy, would that actually be a desirable thing? He warns us that this will exacerbate credentialism as “the last acceptable prejudice,” where for the poor, the message we send to them is bluntly that they are poor because they are bad on the grounds of merit. That’s a tough pill to swallow, which can breed resentment, and Sandel argues for this being one of the reasons why Trump won election in 2016. There are also questions about what properly defines merit, and unfortunate side effects of the race for credentialism, where “helicopter parenting” means young teenagers are trying to fight to gain admission to a small pool of elite universities. This book is more about identifying the problem rather than proposing solutions, but Sandel includes some modest approaches, such as (a) adding a lottery to admissions processes at elite universities, and (b) taxing financial transactions that add little value (though these seem quite incremental to me). Of course, he agrees, it’s better to not have wealth or race be the deciding factor that determines quality of life, as Sandel opens up in his conclusion when describing how future home run record holder Hank Aaron had to practice batting using sticks and bottle caps due to racism. But that does not mean the current meritocracy status quo should be unchallenged.

  • ** COVID-19: The Pandemic That Never Should Have Happened, and How to Stop the Next One ** (2020) by New Scientist reporter Debora MacKenzie, was quickly written in early 2020 and published in June, while the world was still in the midst of the pandemic. The book covers the early stages of the pandemic and how governments and similar organizations were unprepared for one of this magnitude despite early warnings. MacKenzie provides evidence that scientists were warning for years about the risks of pandemics, but that funding, politics, and other factors hindered the development of effective pandemic strategies. The book also provides a history of some earlier epidemics, such as the flu of 1918 and SARS in 2003, and why bats are a common source of infectious diseases. (But don’t go around killing bats, that’s a completely misguided way of fighting COVID-19.) MacKenzie urges us to provide better government support for research and development into vaccines, since while markets are a great thing, it is difficult for drug and pharmaceutical companies to make profits off of vaccines while investing in the necessary “R and D.” She also wisely says that we need to strengthen the World Health Organization (WHO), so that the WHO has the capability to quickly and decisively state when a pandemic is occurring without fear of offending governments. I think MacKenzie hits on the right cylinders here. I support globalization when done correctly. We can’t tear down the world’s gigantic interconnected system, but we can at least make systems with more robustness for future pandemics and catastrophic events. As always, though, it’s easier said than done, and I am well aware that many people do not think as I do. After all, my country has plenty of anti-vaxxers, and every country has its share of politicians who are hyper-nationalistic and are willing to silence their own scientists who have bad news to share.

Group 3: Business and Technology

  • Remote: Office Not Required (2013) is a concise primer on the benefits of remote work. It’s by Jason Fried and David Hansson, cofounders of 37Signals (now Basecamp), a software company which specializes in one product (i.e., Basecamp!) to organize projects and communication. I used it once, back when I interned at a startup. Basecamp has unique work policies compared to other companies, which the authors elaborate upon in their 2017 manifesto It Doesn’t Have to be Crazy At Work (discussed below). This book narrows down on the remote aspect of their workforce, reflecting how Basecamp’s small group of employees works all around the world. Fried and Hansson describe the benefits of remote work: a traditional office is filled with distractions, the commute to work is generally unpleasant, talent isn’t bound in specific cities, and so on. Then, they show how Basecamp manages their remote work force, essentially offering a guide to other companies looking to make the transition to remote work. I think many are making the transition if they haven’t done so already. If anything, I was surprised that it’s necessary to write a book on these “obvious” facts, but then again, this was published right when Marissa Mayer, then Yahoo!’s CEO, famously said Yahoo! would not permit remote work. In contrast, I was reading this book in April 2020 when we were in the midst of the COVID-19 pandemic which essentially mandated remote work. While I miss in-person work, I’m not going to argue against the benefits of some remote work.

  • Chaos Monkeys: Obscene Fortune and Random Failure in Silicon Valley (2016) is by Antonio García Martínez. A “gleeful contrarian” who entered the world of Silicon Valley after a failed attempt at becoming a scientist (formerly a physics PhD student at UC Berkeley) and then a stint as a Goldman Sachs trader, he describes his life at Adchemy5, then as the CEO of his startup, AdGrok, and then his time at Facebook. AdGrok was a three-man startup with Martínez and two other guys, specializing in ads, and despite all their missteps, it got backed by the Y-Combinator. Was it bought by Facebook? Nope — by Twitter, and Martínez nearly screwed the whole acquisition by refusing to work for Twitter and joining Facebook, essentially betraying his two colleagues. At Facebook, he was a product manager specializing in ads, and soon got embroiled over the future ads design; Martínez was proposing a new system called “Facebook Exchange” whereas his colleagues mostly wanted incremental extensions of the existing Facebook Ads system (called “Custom Audiences”). He was eventually fired from Facebook, and then went to Twitter as an adviser, and as of 2019 he’s at Branch. Here’s a TL;DR opinionated summary: while I can see why people (usually men) might like this fast-paced exposé of Silicon Valley, I firmly believe there is a way to keep his good qualities — his determination, passion, focus — without the downsides of misogyny, getting women pregnant two weeks after meeting them, and flouting the law. I’ll refer you to this criticism for more details, and to add on to this, while Martínez is able to effectively describe concepts in Silicon Valley and computing reasonably well, he often peppers those comments with sexual innuendos. This is absolutely not the norm among the men I work with. I wonder what his Facebook colleagues thought of him after reading this book. On a more light-hearted note, soon after reading Chaos Monkeys, I watched Michael I Jordan’s excellent podcast conversation with Lex Fridman on YouTube6. Prof. Jordan discusses and criticizes Facebook’s business model for failing to create a “consumer-producer ecosystem” and I wonder how much the idea of Facebook Exchange overlaps with Prof. Jordan’s ideal business model.

  • ** It Doesn’t Have to Be Crazy at Work ** (2017). The authors are (again) Jason Fried and David Hansson, who wrote Remote: Office Not Required (discussed above). I raced through this book, with repeated smiles and head-nodding. Perhaps more adequately described as a rousing manifesto, it’s engaging, fast-paced, and effectively conveys how Basecamp manages to avoid enforcing a crazy work life. Do we really need 80-hour weeks, endless emails, endless meetings, and so on? Not according to Basecamp: “We put in about 40 hours a week most of the year […] We not only pay for people’s vacation time, we pay for the actual vacation, too. No, not 9 p.m. Wednesday night. It can wait until 9 a.m. Thursday morning. No, not Sunday. Monday.” Ahh … Now, I definitely don’t follow what this book says word-for-word. For example, I work far more than 40 hours a week. My guess is 60 hours, and I don’t count time spent firing off emails in the evening. But I do my best. I try to ensure that my day isn’t consumed by meetings or emails, and that I have long time blocks to myself for focused work. So far I think it’s working for me. I feel reasonably productive and have not burnt out. I try to continue this during the age of remote work. Basecamp has been a working remotely for 20 years, and their software (and hopefully work culture) may have gotten more attention recently as COVID-19 spread through the world. Perhaps more employers will enable remote work going forward.

  • ** Brotopia: Breaking Up the Boys’ Club of Silicon Valley ** (2018) is by Emily Chang, a journalist, author, and current anchor of Bloomberg Technology. For those of us wondering why Silicon Valley continues to be heavily male-dominated despite years and years of public outcry, Chang offers a compelling set of factors. Brotopia briefly covers the early history of the tech industry and how employees were screened for certain factors that statistically favored men. She reviews the “Paypal Mafia” and why meritocracy is a myth, and then covers Google, a company which has for years had good intentions but has experienced its own share of missteps, lawsuits, and press scrutiny over its treatment of women. Then there’s the chapter that Chang reportedly said was “the hardest to research by far,” about secret parties hosted by Venture Capitalists and other prominent men in the tech industry, where they network and invite young women.7 Chang points out that incentives given by tech companies to employees (e.g., food, alcohol, fitness centers, etc.) often cater to the young and single, and encourage a blend of work and life, meaning that for relatively older women, work-family imbalance is a top reason why they leave the workforce at alarming numbers. The list of factors which make it difficult for women to enter and comfortably remain in tech goes on and on. After reading this book, I am constantly feeling depressed about the state of affairs here — can things really be that bad? There are, of course, things I should do given my own proximity and knowledge of the industry from an academic’s viewpoint in STEM, where we have similar gender representation issues. I can at least provide a minimal promise that I will remember the history in this book and ensure that social settings are more comfortable for women.

  • ** The Making of a Manager: What to do When Everyone Looks to You ** (2019) is by Julie Zhuo, who worked at Facebook for 14 years, and quickly rose through the ranks to become a manager at age 25, and eventually held a Vice President (VP) title. This book, rather than focusing on Zhuo’s personal career trajectory, is best described as a general guide to managing with some case studies from her time at Facebook (appropriately anonymized, of course). Zhuo advises on the first few months of managing, on managing small versus large teams, the importance of feedback (both to reports and to managers), on hiring great people, and so on. A consistent theme is that the goal of managing is to increase the output of the entire team. I also liked her perspective on how to delegate tasks, because as managers rise up the hierarchy, meetings became the norm rather than the exception, and so the people who do “real work” are those lower in the hierarchy but who have to be trusted by managers. I generally view managing in the context of academia, since I am managed by my PhD advisors, and I manage several undergraduates who work with me on research projects. There is substantial overlap in the academic and industry realms, particularly with delegating tasks, and Zhuo’s book — even with its focus on tech — provides advice applicable to a variety of domains. I hope that any future managers I have will be similar in spirit to Zhuo. Now, while reading, I couldn’t help but think about how someone like Zhuo would manage someone like Antonio García Martínez, who wrote Chaos Monkeys (discussed earlier) and overlapped with her time at Facebook, since those two seem to be the polar opposites of each other. Whereas Zhuo clearly values empathy, honesty, diversity, support, and so on, Martínez gleefully boasts about cutting corners and having sex, including one case involving a Facebook product manager. The good news is that Martínez only lasted a few years at Facebook, whereas Zhuo was there for 14 years and left on her own accord to start Inspirit. Hopefully Inspirit will grow into something great!

Group 4: China

As usual, I find that I have an insatiable curiosity for learning more about China. Two focus on women-specific issues. (I have another one that’s more American-based, near the end of this post, along with the “Brotopia” one mentioned above.)

  • ** Leftover Women: The Resurgence of Gender Inequality in China ** (2014) is named based on the phrase derisively describing single Chinese women above a certain age (usually 25 to 27) who are pressured to marry and have families. It’s written by Leta Hong Fincher, an American (bilingual in English and Chinese) who got her PhD in sociology at Tsinghua University. Leftover Women grew out of her dissertation work, which involved interviews with several hundred Chinese, mostly young well-educated women in urban areas. I had a rough sense of what gender inequality might be like, given its worldwide prevalence, but the book was able to effectively describe the issues specific to China. One major theme is housing in big cities, along with a 2011 law passed by the Chinese Supreme Court which (in practice) meant that it became more critical whose name was on the house deed. For married couples who took part in the house-buying spree over the last few decades (as part of China’s well-known and massive rural-to-urban migration), usually the house deed used the man’s name. This exacerbates gender inequality, as Hong Fincher repeatedly emphasizes that property and home values have soared in recent years, making those more important to consider than the salary one gets from a job. Despite these and other issues in China, Hong-Fincher reports some promising ways that grassroots organizations are attempting to fight these stereotypes for women, despite heavy government censorship and disapproval. I was impressed enough by Hong-Fincher’s writing to read her follow-up 2018 book. In addition, I also noticed her Op-Ed for CNN arguing that women are disproportionately better at handling the COVID-19 pandemic.8 Her name has come up repeatedly as I continue my China education.

  • ** Betraying Big Brother: The Feminist Awakening in China ** (2018) is the second book I read from Leta Hong Fincher. Whereas Leftover Women featured the 2011 Chinese Supreme Court interpretation of a housing deed law, this book emphasizes the Feminist Five, young Chinese women who were arrested for protesting sexual harassment. You can find an abbreviated overview with a Dissent article which is a nice summary of Betraying Big Brother. The Feminist Five women were harassed in jail and continually spied upon and followed after their release. (Their release may have been due to international pressure). It was unfortunate to see what these women had to go through, and I reminded myself that I’m lucky to live in a country where women (and men) can perform comparable protests with limited (if any) repercussions. In terms of Chinese laws, the main one relevant to this book is a recent 2016 domestic violence law, the first of its kind to be passed in China. While Fincher praises the passage of this law, she laments that enforcement is questionable and that gender inequality continues to persist. She particularly critiques Xi Jinping and the “hypermasculinity” that he and the Chinese Communist Party promotes. The book ends on an optimistic note on how feminism continues to persist despite heavy government repression. Furthermore, though this book focuses on China, Hong Fincher and the Feminist Five emphasize the need for an international movement of feminism that spans all countries (I agree). As a case in point, Hong Fincher highlights how she and other Chinese women attended the American women’s march to protest Trump’s election. While I didn’t quite learn as much from this book compared to Leftover Women, I still found this to be a valuable item in my reading list about feminism.

  • ** Superpower Showdown: How the Battle Between Trump and Xi Threatens a New Cold War ** (2020) by WSJ reporters Bob Davis and Lingling Wei was fantastic – I had a hard time putting this book down. It’s a 450-page, highly readable account of diplomatic relations between the United States and China in recent years. The primary focus is the negotiation behind the scenes that led to the US-China Phase 1 trade deal in January 2020. As reporters, the authors had access to high-ranking officials and were able to get a rough sense of how each “side” viewed each other, not only from the US perspective but also from China’s. The latter is unusual, as the Chinese government is less open with its decision-making, so it was nice to see a bit into how Chinese government officials viewed the negotiations. Davis and Wei likely split the duties by Davis reporting from the American perspective, and Wei reporting from the Chinese perspective. (Wei is a naturalized US citizen, and was among those forced to leave China when they expelled journalists in March 2020.) The authors don’t editorialize too much, beyond trying to describe why they believed certain negotiations failed via listing the mistakes made on both sides — and there were a lot of failed negotiations. Don’t ever say geopolitics is easy. Released in Spring 2020, Superpower Showdown was just able to get information about the COVID-19 pandemic, before it started to spread rapidly in the United States. Unfortunately, COVID-19, rather than uniting the US and China against a common enemy, instead further deteriorated diplomatic relations. Just after finishing the book, I found a closely-related Foreign Affairs essay by Trump’s trade representative Robert E. Lighthizer. Consequently, I now have Foreign Affairs on my reading list.

  • ** Blockchain Chicken Farm: And Other Stories of Tech in China’s Countryside ** (2020) by Xiaowei Wang, who like me is a PhD student at UC Berkeley (in a different department, in Geography). Xiaowei is an American who has family and friends throughout China, and this book is partially a narrative of Wang’s experience visiting different parts of the country. Key themes are visiting rural areas in China, rather than the big cities which get much of the attention (as China is also undergoing a rural-to-urban migration as in America), and the impact of technology towards rural areas. For example, the book mentions how chickens and pigs are heavily monitored with technology to maximize their fitness for human consumption, how police officers are increasingly turning to facial recognition software while still heavily reliant on humans in this process, and the use of Blockchain even though the rural people don’t understand the technology (to be fair, it’s a tricky concept). Wang cautions us that increased utilization of technology and AI will not be able to resolve every issue facing the country, and come with well-known drawbacks (that I am also aware of given the concern over AI ethics in my field) that will challenge China’s leaders, so that they can continue to feed their citizens and maintain political stability. It’s a nice, readable book that provides a perspective of the pervasiveness but also the limitations of technology in rural China.

Group 5: Race and Anti-Racism

  • ** Evicted: Poverty and Profit in the American City ** (2016) is incredible. I can’t add much more praise to what’s already been handed to this Pulitzer Prize-winning book. Evicted is by Matthew Desmond, a professor of sociology at Princeton University. Though published in 2016, it was in 2008 and 2009 when he was a graduate student at the University of Wisconsin, when he moved into a trailer park in Milwaukee where poor whites lived. Desmond spent a few months following and interviewing residents and the landlord. He then repeated the process in the North Side of Milwaukee, where poor blacks lived. The result is an account of what it is like to be poor in America and facing chronic evictions.9 One huge problem: these tenants often had to pay 60-80 percent of their government welfare checks to rent. I also learned about how having children increases the chances of eviction, and how women are more vulnerable to eviction than men, and the role of race. The obvious question, of course, is what kind of policy solutions can help to improve the status quo. Desmond’s main suggestion he posits in the epilogue is for a universal housing voucher, which might reduce the amount spent on homeless shelters. Admittedly, I understand that we need both good policies and better decision-making on the part of these tenants, so it’s important for us to ensure that there are correct incentives for people to “graduate from welfare.” Interestingly, Desmond didn’t seem to discuss rent control that much, despite how it is a common topic I hear about nowadays. Another policy that might be relevant to this book is drug use, since pretty much every tenant here was on drugs. I generally oppose rent control and oppose widespread drug usage, but I also admit that implementing these policies would not fix the immediate problems the tenants face. Whatever your political alignments, if you haven’t done so, I strongly recommend you add Evicted to your reading list. The only very minor suggestion I would ask for this book is to have an easy-to-find list of names and short biographies of the tenants at the start of the book.

  • ** White Fragility: Why It’s So Hard to Talk to White People About Racism ** (2018), by Robin DiAngelo, shot up to the NYTimes best-sellers list earlier this year, in large part from racial protests happening in the United States. Her coined phrase “white fragility” has almost become a household name. As DiAngelo says in the introduction, she is white and the book is mainly addressed to a white audience. (I am not really the target audience, but I still wanted to read the book.) DiAngelo discusses her experience trying to lead racial training training sessions among employees, and how whites often protest or push back against what she says. This is where the term “white fragility” comes from. Most whites she encounters are unwilling to have extensive dialogues that acknowledge their racial privileges, or try to end the discussion by saying defensive statements such as: “I am not racist, so I’m OK, someone else is the problem, end of story.” I found the book to be helpful and thought provoking, and learned about several traps that I will avoid when thinking about race. When reading the book, while I don’t think I personally felt challenged or insulted, I thought it served exactly as DiAngelo intended: to help me build up knowledge and stamina for discussion over racial issues.

  • ** So You Want to Talk About Race ** (2018), by Ijeoma Oluo, attempts to provide guidelines for how we can talk about race. Like many books falling under the anti-racist theme, it’s mainly aimed for white people to help them understand why certain topics or conduct are not appropriate for conversations on race. For example, consider chapters titled “Why can’t I say the ‘N’ word?” and “Why can’t I touch your hair?”. While some of these seem like common sense to me — I mean, do people actually go about touching Black people’s hair, or anyone’s body? — I know that there’s enough people who do this that we need to have this conversation. Oluo also effectively dispels the notion that we can just talk about class instead of race, or that we’ll get class out of the way first. I also appreciate her mention of Asians in the chapter on why the model minority myth is harmful. I also see that Oluo wrote in the introduction about how she wished she could have allocated more discussion on Indigenous people. I agree, but no book can contain every topic, so it’s not something I would use to detract from her work. Oluo has a follow-up book titled Mediocre: The Dangerous Legacy of White Male America, which I should check out soon.

  • Me and White Supremacy: Combat Racism, Change the World, and Become a Good Ancestor (2020) by Layla F. Saad. This started as a 28-day Instagram challenge that went viral. It was published in January 2020, and the timing could not have been better, given that just a few months later, we would see America face enormous racial protests. I read this book right after reading White Fragility, whose author (Robin DiAngelo) wrote the foreword, and says that Layla F. Saad gives us a roadmap for addressing the most common question white people have after an antiracist presentation: “What do I do?” In her introduction, Saad says: ““The system of white supremacy was not created by anyone who is alive today. But it is maintained and upheld by everyone who holds white privilege.” Saad, an East African and Middle Eastern Black Muslim women who lives in Qatar and is a British citizen, wants us to tackle this problem so that we leave the world a better place than it is today. Me and White Supremacy is primarily aimed at white people, but also applies to people of color who hold “white privilege” which would apply to me. There are four parts: (1) the basics, (2) anti-blackness, racial stereotypes, and cultural appropriation, (3) allyship, and (4) power, relations, and commitments. For example, the allyship chapter mentions white apathy, white saviorism (as shown in The Blind Side and others), tokenism, and being “called out” for racism, which Saad says is inevitable if we take part in anti-racism work. In contrary to what I think Saad was expecting out of readers, I didn’t experience too many conflicting emotions or uncomfortable feelings when reading this book. I don’t know if that’s a good thing or a bad thing. It may have been because I read this after White Fragility and So You Want to Talk About Race?. I will keep this book in mind, particularly the allyship section, now and in the future.

  • ** My Vanishing Country: A Memoir ** (2020) is a memoir by Bakari Sellers, who describes his experience living in South Carolina. The value of the book is providing the perspective of Black rural working class America, instead of the white working class commonly associated with rural America (as in J.D. Vanci’s Hillbilly Elegy). I read the memoir quickly and could not put it down. Here are some highlights from Sellers’ life. When he was 22, freshly graduated out of Morehouse College and in his first year of law school at the University of South Carolina, he was elected to the South Carolina House of Representatives.10 Somehow, he simultaneously served as a representative while also attending law school. His representative salary was only 10,000 USD, which might explain why it’s hard for the poor to build a career in state-level politics. He earned attention from Barack Obama, whom Sellers asked to come to South Carolina in return for Sellers’ endorsement in the primaries. Eventually, he ran for Lieutenant Governor (as a Democrat), a huge challenge in a conservative state such as South Carolina, and lost. He’s now a political commentator and a lawyer. The memoir covers the Charleston massacre in 2015, his disappointment when Trump was elected president (he thought that white women would join forces with non-whites to elect Hilary Clinton), and a personal story where his wife had health issues when giving birth, but survived. Sellers credits the fact that the doctors and nurses there were Black and knew Sellers personally, and he concludes with a call to help decrease racial inequities in health care, which persist today in the mortality rate when giving birth, and also with lead poisoning in many predominantly Black communities such as in Flint, Michigan.

Group 6: Countries

I continue utilizing the “What Everyone Needs to Know” book series. However, the batch I picked this year was probably less informative compared to others in the series. However, I’m especially happy to have read the fourth book here about Burma (not part of “What Everyone Needs to Know”), which I found from reading Foreign Affairs.

  • Brazil: What Everyone Needs to Know (2016) by Riordan Roett, Professor Emeritus at the Johns Hopkins University’s School of Advanced International Studies (SAIS), who specializes in Latin American studies. Brazil is always a country that I’ve wanted to know more, given its size (in population and land area), its geopolitical situation in a place (Latin America) that I know relatively little about, and because of the Amazon rain forest. The book begins with the early recorded history of Brazil based on the Portuguese colonization, followed by the struggle for independence. It also records Brazil’s difficulties with establishing Democracy versus military rule. Finally, it concludes with some thought questions about foreign affairs, and Brazil’s relations with the US, China, and other countries. This isn’t a page-turner book, but I think the bigger issue is that so much of what I want to know about Brazil relates to what happened over the last 5 years, particularly given the increasingly authoritarian nature of Brazil’s leadership since then, with President Jair Bolsonaro.

  • Iran: What Everyone Needs to Know (2016), by the late historian Michael Axworthy, provides a concise overview of Iran’s history. I bought it on iBooks and started reading it literally the day before the murder of Qasem Soleimani. Soleimani was widely believed to be next-in-line to succeed Ali Khamenei as the Supreme Leader of Iran; the “Supreme Leader” is the highest office in Iran. If you are interested in a recap of those events, see this NYTimes account on the events that nearly brought war between the US and Iran. The book was published in 2016 so it did not contain that information, and the last question was predictably about the future of Iran after the 2015 Nuclear Deal,11 with Axworthy noting that Iran seems to be pulled in “incompatible directions,” one for liberalization and modernity, the other for conservative Islam and criticism of Israel. The book mentions the history of the people who lived in the area that is now Iran. Back then, that was the Persian Empire, and I liked how Axworthy commented on Cyrus and Darius I, since they are the two Persian leaders in the Civilization IV computer game that I used to play. Later, Axworthy mentions the Iran-Iraq war and the Revolution of 1979 which deposed the last Shah (Mohammad Reza Pahlavi) in favor of Ruhollah (Ayatollah) Khomeini. Overall, this book is OK but was boring in some areas, and is too brief. It may be better to read Axworthy’s longer (but older) book about Iran.

  • Russia: What Everyone Needs to Know (2016) is by Timothy J. Colton, a Harvard University Professor of Government and Russian Studies. The focus of this book is on Russia, which includes the Soviet Union from the period of 1922 to its dissolution in 1991 into 15 countries, one of which was Russia itself. As usual for “What Everyone Needs to Know” books, it starts with dry early history. The book gets more interesting when it presents the Soviet Union (i.e., USSR) and its main leaders: Joseph Stalin, Nikita Krushchev, Leonid Brezhnev, and Mikhail Gorbachev. Of those leaders, I support Gorbachev the most due to glasnost, and oppose Stalin the most, from the industrial-scale killing on his watch. Then there was Boris Yeltsin and, obviously, Vladimir Putin, who is the subject of much of the last chapter of the book. This book, like the one about North Korea I read last year, ponders about who might succeed Vladimir Putin as the de facto leader of Russia? Putin is slated to be in power until at least 2024, and he likely won’t give it to his family given that he has no sons. Russia faces other problems, such as alcoholism and demographics, with an aging population and a significantly lower average lifespan for males compared to other countries of Russia’s wealth. Finally, Russia needs to do a better job at attracting and retaining talent in science and engineering. This is one of the key advantages the United States has. (As I said earlier, we cannot relinquish this advantage.) Final note: Colton uses a lot of advanced vocabulary in this book. I had to frequently pause my reading to refer to a dictionary.

  • ** The Hidden History of Burma: Race, Capitalism, and the Crisis of Democracy in the 21st Century ** (2020) is Thant Myint-U’s latest book on Burma (Myanmar)12. Thant Myint-U is is now one of my expert sources for Burmese-related topics. He’s lived there for many years, and has held American and Burmese citizenship at various points in his life. He is often asked to advise the Burmese government and frequently engages with high-level foreign leaders. His grandfather, U Thant, was the third Secretary General of the United Nations from 1961 to 1971, and I’m embarrassed I did not know that; amusingly, Wikipedia says U Thant was the first Secretary General who retired while on speaking terms with all major powers. The Hidden History of Burma discusses the British colonization, the struggle for independence, and the dynamics of the wildly diverse population (in terms of race and religion). Featured heavily, of course, is Aung San Suu Kyi, the 1991 Nobel Peace Prize13 recipient, and a woman who I first remember learning about back in high school. She was once viewed as the beacon of Democracy and human rights — until, sadly, the last few years. She’s been the current de facto leader of government and overseeing one of the most brutal genocides in modern history of the Rohingya Muslims. Exact numbers are unclear, but it’s estimated that hundreds of thousands have either been killed or have fled to neighboring Bangladesh. How did this happen? The summary is that it wasn’t so much that Burma (and Aung San Suu Kyi) made leaps and bounds of progress before doing a 180 sometime in 2017. Rather, the West, and other foreigners who wanted to help, visit, and invest in the country, badly miscalculated and misinterpreted the situation in Burma while wanting to view Aung San Suu Kyi as an impossibly impeccable hero. There’s a lot more in the book about race, identity, and capitalism, and how this affects Burma’s past, present, and future. Amusingly, I’ve been reading Thant Myint-U’s Twitter feed, and he often fakes confusion as to whether his tweets are referring to the US or Burma: A major election? Widening income inequality? Illegal immigrants? Big bad China? Environmental degradation? Social media inspired violence? Who are we talking about here? For another perspective on the book, see this CFR review.

Group 7: Psychology and Psychiatry

  • ** 10% Happier: How I Tamed the Voice in My Head, Reduced Stress Without Losing My Edge, and Found Self-Help That Actually Works–A True Story ** (2014) is a book by Dan Harris which (a) chronicles his experience with meditation and how it can reduce stress, and (b) attempts to present meditation as an option to many readers but without the big “PR problem” that Harris admits plagues meditation. For (a), Harris turned to meditation to reduce the anxiety and stress he was experiencing as a television reporter; he had several panic attacks on air and, for a time, turned to drugs. His news reporting got him involved with religious, spiritual, and “happiness” gurus who turned out to be frauds (Ted Haggard and James Arthur Ray), which led Harris to question the self-help industry. A key turning point in Harris’ life was attending a 10-day Buddhist meditation retreat in California led by Joseph Goldstein. He entered the retreat in part due to the efforts of famous close friend Sam Harris (no relation). After the retreat, he started practicing meditation and even developed his own “10% Happier app” with colleagues. Harris admits that meditation isn’t a panacea for everything, so that’s one of the reasons for the wording “10% happier” in the title. I read many books, so because of sheer quantity, it’s rare when I can follow through on a book’s advice. I will try my best here. My field of computer science and robotics research is far different from Harris’ field, I also experience some stress in maintaining my edge due to the competitive nature of research, so hopefully I can follow this. Harris says all we need are 5 minutes a day. To start: sit comfortably, feel your breath, and each time you get lost in thought, please gently return to breath and start over.

  • ** Misbehaving: The Making of Behavioral Economics ** (2015) by Nobel Laureate Richard Thaler of the University of Chicago, is a book that relates in many ways to Daniel Kahneman’s Thinking, Fast and Slow (describing work in collaboration with Amos Tversky). If you like that book, you will probably like this one, since it covers similar themes, which shouldn’t be surprising as Thaler collaborated with Kahneman and Tversky for portions of his career. Misbehaving is Thaler’s personal account of his development of behavioral economics, a mix of an autobiography and “research-y” topics. It describes how economics has faced internal conflicts between those who advocate for a purely rational view of agents (referred to as “Econs” in the book) and those who incorporate elements of human psychology into their thinking, which may cause classical economic theory to fail due to irrational behavior by humans. In chapter after chapter, Thaler argues convincingly that human behavior must be considered to understand and properly predict economic behavior.

  • Option B: Facing Adversity, Building Resilience, and Finding Joy (2016) is co-written by Sheryl Sandberg and Adam Grant, and for clarity is told from the perspective of Ms. Sandberg. She’s the well-known Chief Operating Officer of Facebook and the bestselling author of Lean In, which I read a few years ago. This book arose out of the sudden death of her former husband, Dave Goldberg, in 2015, and how she went through the aftermath. Option B acknowledges that, sometimes, people simply cannot have their top option, and must deal with the second best situation, or the third best, and so on. It also relates to Lean In to some extent; that book was criticized for being elitist in nature, and Option B emphasizes that many women may face roadblocks to career success and financial safety, and hence have to consider “second options.” Option B contains anecdotes from Sandberg’s experience in the years after her husband’s death, and integrates other stories (such as the famous Uruguay flight which crashed, leading survivors to resort to cannibalism) and psychological studies to investigate how people can build resilience and overcome such traumatic events. As of mid-2020, it looks like Ms. Sandberg is now engaged again, so while this doesn’t negate her pain of losing Dave Goldberg, she shows – both in the book and in person – that one can find joy again after tragedy.

  • Good Reasons for Bad Feelings: Insights from the Frontier of Evolutionary Psychiatry (2019), by Randolph M. Nesse, a professor at Arizona State University, is about psychiatry. Wikipedia provides a short intro: psychiatry is the medical specialty devoted to the diagnosis, prevention, and treatment of mental disorders. This book specializes in the evolutionary aspect of psychiatry. A key takeaway from the book is that humans did not evolve to have mental illness or disorders. Dr. Nesse has an abbreviation for this: Viewing Diseases As Adaptations (VDAA), which he claims is the most common and serious mistake in evolutionary medicine. The correct question is, instead, why did natural selection shape traits that make us vulnerable to disease? There are intuitive explanations. For one, any personality trait exhibits itself across a spectrum of extremity. Some anxiety is necessary to help protect against harm, but having too much can be a classic sign of a mental disorder. Also, what was best back for our ancestors is not true today, as vividly demonstrated by the surge in obesity in developed countries. Another takeaway, one that I probably should have expected, is that the science of psychiatry has had plenty of controversy. Consider the evolutionary benefits of homosexuality (if any). Dr. Nesse says it’s a common question he gets, and he avoids answering because he doesn’t think it’s settled. From my non-specialist perspective, this book was a readable introduction to evolutionary psychiatry.

Group 8: Miscellaneous

  • ** The Conscience of a Liberal ** (2007, updated forward 2009) is a book by the well-known economist and NYTimes columnist Paul Krugman. The title is similar to that of Barry Goldwater’s 1960 book, and of course, the 2017 version from former Senator Jeff Flake (which I read). In The Conscience of a Liberal, Krugman describes why he is a liberal, discusses the rise of modern “movement” Conservatism, and argues that a Democratic presidential administration must prioritize universal health care. The book was written in 2007, so he couldn’t have known that Obama would win in 2008 and pursue Obamacare, and I know from reading Krugman’s columns over the years that he’s very pro-Obamacare. Many of Krugman’s columns today at the NYTimes reflect the writing in this book. That’s not to say the ideas are stale — much of it is due to the slow nature of government in that it takes us ages to make progress on any issue, such as the still-unseen universal health care. Krugman consistently argues in the book (as in his columns) for having a public option in addition to a strong private sector, rather than creating true socialized medicine which is what Britain uses. Regarding Conservatism, Krugman gets a lot right here: he essentially predicts correctly that Republicans can’t just get rid of Obamacare due to the huge backlash, just like Eisenhower-type Republicans couldn’t get rid of the New Deal. I also think he’s right on race, in that the Republicans have been able to get an alliance between the wealthy pro-business and low-tax elite with the white working class, a bond which is even stronger today under Trump. My one qualm is his surprising discounting of abortion as a political issue. It’s very strong in unifying the Republican party, but perhaps he’d change that in a modern edition.

  • ** Steve Jobs ** (2011) by acclaimed writer Walter Isaacson is the definitive biography of Steve Jobs. Described as a classic “wartime CEO” by Ben Horowitz in The Hard Thing About Hard Things, Jobs co-founded Apple with Steve Wozniak, but by 1985, Jobs was forced to leave in the wake of internal disagreements. Then, after some time in another startup and at Pixar, Jobs returned to Apple in 1997 when it was on the verge of bankruptcy, and somehow in the 2010s, Apple was on its way to being the most valuable company in the world and the first to hit 1 trillion in market capitalization. While writing the biography, Isaacson had access to Steve Jobs, his family, friends, and enemies. In fact, Isaacson had explicit approval from Jobs, who asked him to write the book on the basis of Isaacson’s prior biographies of Benjamin Franklin, Albert Einstein, and others. I am not sure if Jobs ever read this book, since he passed away from cancer only a few months after this book was published. The book is a mammoth 550-page volume, but it reads very quickly, and I often found myself wishing I could read more and more – Isaacson has a gift for tracing the life of Jobs, his upsides and downsides, and interactions with people as part of his CEO experience. There’s also a fair amount about the business aspects of Apple that made me better understand how things work. I can see why people might think it’s definitely recommended reading for MBAs. I wonder, and I hope, that there are ways to achieve his business success and talents without having the downsides of: angry outbursts, super-long work hours, demand for control, refusing and imposing unrealistic expectations (his “reality distortion field”). I would be curious to see how he contrasts with the style of other CEOs.

  • The Only Investment Guide You’ll Ever Need by Andrew Tobias is a book with a bad title but which has reasonably good content. It was first written in 1978, but has been continually updated over the years, and the most recent version which I read was the 2016 edition. As I prepare to move beyond my graduate student days, I should use my higher salary to invest more. Why? With proper investment, the rate of return on the money should be higher than if I let it sit in a savings account accumulating interest. Of course, that depends on investing wisely. The first part of the book has advice broadly applicable to everyone: how to save money in so-called incremental ways that add up over time. While advice such as buying your own coffee instead of going to Starbucks and living slightly below your means sounds boring and obvious, it’s important to get these basics out of the way. The second part dives more into investing in stocks, and covers concepts that are more foreign to me. My biggest takeaway is that one should avoid commission fees that add up, and that while it’s difficult to predict stocks, in the long run, investing in stocks generally pays off. This book, being a guide, is the kind that’s not necessarily meant to be read front-to-back, but one where I should return to every now and then on demand to get an opinion on an investing related topic.

  • Nasty Women: Feminism, Resistance, and Revolution in Trump’s America (2017) is a series of about 20 essays by a diverse set of women, representing different races, religions, disabilities, sexual orientations, jobs, geographic locations, and various other qualities. It was written shortly after Trump’s election, and these women unanimously oppose him. It was helpful to understand the experiences of these women, and how they felt threatened by someone who bragged about sexual assault and has some retrograde views on women. There was clear disappointment from these women towards the “53% of white women who voted for Trump,” a statistic repeated countless times in Nasty Women. On the issue of race, some of the Black women writers felt conflicted about attending the Women’s March, given that the original idea for these marches came from Black women. I agree with the criticism of these writers towards some liberal men, who may have strongly supported Bernie Sanders but had trouble supporting Clinton. For me, it was actually the reverse; I voted for Clinton over Sanders in the primaries. That said, I don’t agree with everything. For example, one author criticized the notion of Sarah Palin calling herself a feminist, and said that we need a different definition of feminism that doesn’t include someone like Palin. I think women have a wide range of beliefs, and we shouldn’t design feminism to leave Conservative women out of the umbrella. Nonetheless, there’s a lot of agreement between me and these authors.

  • The Hot Hand: The Mystery and Science of Streaks (2018) is by WSJ reporter Ben Cohen, who specializes in covering the NBA, NCAA, and other sports. “The hot hand” refers to a streak in anything. Cohen goes over the obvious: Stephen Curry is the best three point shooter in the history of basketball, and he can get on a hot streak. But, is there a scientific basis to this? Is there actually a hot hand, or does Curry just happen to hit his usual rate of shots, except that due to the nature of randomness, sometimes he will just have streaks? Besides shooting, Cohen reviews streaks in areas such as music, plays, academia, business, and Hollywood. From the first few chapters, it seems like most academics don’t think there is a hot hand, whereas people who actually perform the tasks (e.g., athletes) might think otherwise. The academics include Amos Tversky and Daniel Kahneman, the two famous Israeli psychologists who revolutionized their field. However, by the time we get to the last chapter of this book, Cohen points out two things that somehow were missed in most earlier discussions of the hot hand. First, basketball shots and similar things are not “independent, identically, distributed,” and controlling for the harder shot selection that people who think they have “the hot hand” take, they actually overperform relative to expectations. The second is slightly more involved but has to do with sequences of heads and tails that has profound implications in interpreting the hot hand. In fact, you can see a discussion on Andrew Gelman’s famous blog. So, is there a hot hand? The book leaves the question open, which I expected since a vague concept like this probably can’t be definitively proved or disproved. Overall, it’s a decent book. My main criticism is that some of the anecdotes (e.g., the search for a Swedish man in a Soviet prison and the Vincent van Gogh painting) don’t really mesh well with the book’s theme.

  • How to Do Nothing: Resisting the Attention Economy (2019) by artist and writer Jenny Odell is a manifesto about trying to move focus away from the “attention economy” as embodied by Facebook, Twitter, and other social media and websites which rely on click-through and advertisements for revenue. She wrote this after the Trump election, since (a) she’s a critic of Trump, and (b) Trump’s constant use of Twitter and other attention-grabbing comments have turned the country into a constant 24-hour news cycle. Odell cautions against us trying to use “digital detox” as a solution, and reviews the history of several such digital detox or “utopia” sessions that failed to pan out. The book isn’t the biggest page-turner but is still thought-provoking. However, I am not sure about her proposed tactics for “how to do nothing” except perhaps to focus on nature more? She supports preserving nature, along with people who protested the development of condos over preserved land, but this would continue to exacerbate the Bay Area’s existing housing crisis. I see the logic, but I can’t oppose more building. I do agree with reducing the need to attention, and while I do use social media and support its usage, I agree there are limits to it.

  • Inclusify: The Power of Uniqueness and Belonging to Build Innovative Teams (2020) is a recent book by Stefanie K. Johnson, a professor at the University of Colorado Boulder’s Leeds School of Business who studies leadership and diversity. Dr. Johnson defines inclusify “to live and lead in a way that recognizes and celebrates unique and dissenting perspectives while creating a collaborative and open-minded environment where everyone feels they truly belong.” She argues it helps increase sales, drives innovation, and reduces turnover, and the book is her attempt at distilling these lessons about improving diversity efforts at companies. She identifies six types of people who might be missing out on the benefits of inclusification: the meritocracy manager, the culture crusader, the team player, the white knight, the shepherd, and the optimist. I will need to keep these groups in mind to make sure I do not fall into these categories. Despite how I agree with the book’s claims, I’m not sure how much I benefited from reading Inclusify, given that I read this one after several other books this year that covered similar ground (e.g., many “anti-racist” books discuss these topics). I published this blog post a few months after reading the book, and I confess that I remember less about its contents as compared to other books.

  • Master of None: How a Jack-of-All-Trades Can Still Reach the Top (2020) is by Clifford Hudson, the former CEO of Sonic Drive-In, a fast food restaurant chain (see this NYTimes profile for context). This is an autobiography of Hudson who tries to push back against the notion that to live an accomplished life, one needs to master a particular skill, as popularized from books such as Malcolm Gladwell’s Outliers and his “10,000 Rule”. Hudson argues that his life has been fulfilling despite never deliberately mastering one skill. The world is constantly changing, so it is necessary to quickly adapt, to say “yes” to opportunities that arise, and to properly delegate tasks to others who know better. I think Hudson himself serves as evidence for not necessarily needing to master one skill, but the book seems well tailored for folks working in business, and I would be curious to see discussion in an academic context, where the system is built to encourage us to specialize in one field. It’s a reasonably good autobiography and a fast read. I would not call it super great or memorable. I may read David Epstein’s book Range: Why Generalists Triumph in a Specialized World to follow-up on this topic.

Well, that is it for 2020.


  1. Kurzweil predicts that “we will encounter such a non-biological entity” by 2029 and that this will “become routine in the 2030s.” OK, let me revisit that in a decade! 

  2. As far as I know, “polygenic scores” require taking a bunch of DNA samples and predicting outcomes, while CRISPR can actually do the editing of that DNA to lead to such outcomes. I’d be curious if any biochemists or psychologists could chime in to correct my understanding. 

  3. Dana MacKenzie has an interesting story about being denied tenure at Kenyon College (which he taught after leaving Duke, when it was clear he would also not get tenure there). You can find it on his website. There is also a backstory on how he and Judea Pearl got together to write the book. 

  4. Personally, I first found out about Hans Rosling through a Berkeley colleague’s research on data visualization. 

  5. I didn’t realize that Martínez knew David Kauchak during their Adchemy days. I briefly collaborated with Kauchak during my undergraduate research. 

  6. I somehow did not know about Lex Fridman’s AI podcast. If my book reading list feels shallower this year, then I blame his podcast for all those thrilling videos with pioneers of AI and related fields. 

  7. To state the obvious, I have never been to one of these parties. Chang says: “the vast majority of people in Silicon Valley have no idea these kinds of sex parties are happening at all. If you’re reading this and shaking your head […] you may not be a rich and edgy male founder or investor, or a female tech in her twenties.” 

  8. I rarely post status updates on my Facebook anymore, but a few days before her Op-Ed, I posted a graphic I created with pictures of leaders along with their country’s COVID-19 death count and death as a fraction of population. And, yes, my self-selected countries led by female leaders have done a reasonable job controlling the outbreak. I’m most impressed with Tsai Ing-wen of Taiwan, who had to handle this while (a) being geographically close to China itself, and (b) largely ostracized by the wider international community. For an example of the second point, look at how a WHO official dodged a question about Taiwan and COVID-19

  9. If you’re curious, Desmond has a postscript at the end of the book explaining how he did this research project, including when he felt like he needed to intervene, and how the tenants treated him. It’s fascinating, and I wish this section of the book were much longer, but I understand if Desmond did not want to raise too much attention to himself. In addition, there is a lot of data in the footnotes. I read all the footnotes, and recommend reading them even if it comes at the cost of some “reading discontinuity.” 

  10. When he ran for his state political office, he and a small group of campaigners went door-to-door and contacted people face-to-face. I don’t know how this would scale to larger cities or work in the age of COVID-19. Incidentally, there isn’t any discussion on COVID-19, but I suspect if Sellers had written the book just a few months later, he would discuss the pandemic’s disparate impact on Blacks. 

  11. I do not feel like I know enough about the Iran Nuclear Deal to give a qualified statement. I was probably a lukewarm supporter of it, but since the deal no longer appears to be active as of January 2020, I am in favor of a stronger deal (as in, one that can get U.S. congressional approval) if that is at all possible. 

  12. The ruling army junta (i.e., a government led by the military) changed the English name of the country from Burma to Myanmar in 1989. 

  13. She isn’t the only terrible recipient of the Nobel Peace Prize. Reading the list of past recipients sometimes feels like going through one nightmare after the other. 










Mechanical Search in Robotics

Dec 27, 2020

One reason why I enjoy working on robotics is because many of the problems the research community explores are variants of tasks that we humans do on a daily basis. For example, consider the problem of searching for and retrieving a target object in clutter. We do this all the time. We might have a drawer of kitchen appliances, and may want to pick out a specific pot for cooking food. Or, maybe we have a box filled with a variety of facial masks, and we want to pick the one to wear today when venturing outside (something perhaps quite common these days). In the robotics community, recent researchers that I collaborate with have formulated this as the mechanical search problem.

In this blog post, I discuss four recent research papers on mechanical search, split up into two parts. The first two focus on core mechanical search topics, and the latter two propose using something called learned occupancy distributions. Collectively, these papers have appeared at ICRA 2019 and IROS 2020 (twice), and one of these is an ICRA 2021 submission.

Mechanical Search and Visuomotor Mechanical Search

The ICRA 2019 paper formalizes mechanical search as the task of retrieving a specific target object from an environment containing a variety of objects within a time limit. They frame the general problem using the Markov Decision Process (MDP) framework, with the usual states, actions, transitions, rewards, and so on. They consider a specific instantiation of the mechanical search MDP as follows:

  • They consider heaps of 10-20 objects at the start.
  • The target object to extract is specified by a set of $k$ overhead RGB images.
  • The observations at each time step (which a policy would consume as input) are RGB-D, where the extra depth component can enable better segmentation.
  • The methods they use do not use any reward signal.
  • They enable three action primitives: (a) push, (b) suction, and (c) grasp.

The push action is there so that the robot can rearrange the scene for better suction and grasp actions, which are the primitives that actually enable the robot to retrieve the target object (or distractor objects, for that matter). While more complex action primitives might be useful for mechanical search, this would introduce complexities due to the curse of dimensionality.

Here’s the helpful overview figure from the paper (with the caption) showing their instantiation of mechanical search:


I like these type of figures, and they are standard for papers we write in Ken Goldberg’s lab.

The pipeline is split up into a perception stage and a search policy stage. The perception stage first computes a set of object masks from the input RGB-D observation. It then uses a trained Siamese Network to check the “similarity” between any of these masks, and those of the target images. (Remember, in their formulation, we assume $k$ separate images that specify the target, so we can feed all combinations of each target image with each of the computed masks.) If a target image is found, then they can run the search policy to select one of the three allowed action primitives, depending on the action primitive with the highest “score.” How is this value chosen? We can use off-the-shelf Dex-Net policies to compute the probability of action successes. Please refer to my earlier blog post here about Dex-Net.

Here are a couple of things that might not be clear upon a first read of the paper:

  • There’s a difference between how action qualities are computed in simulation versus real. In simulation, grasp and suction actions both use indexed grasps from a simulated Dex-Net 1.0 policy in simulation, which is easy to use as it avoids having to run segmentation. In addition, Dex-Net 1.0 literally contains a dataset of simulated objects plus successful grasps for each object, so we can cycle through those as needed.
  • In real, however, we don’t have easy access to this information. Fortunately, for grasp and suction actions, we have ready-made policies from Dex-Net 2.0 and Dex-Net 3.0, respectively. We could use them in simulation as well, it’s just not necessary.

To be clear, this is how to compute the action quality. But there’s a hierarchy: we need an action selector that can use the computed object masks (from the perception stage) to decide which object we want to grasp using the lower-level action primitives. This is where their 5 algorithmic policies come into play, which correspond to “Action Selector” in the figure above. They test with random search, prioritizing the target object (with and without pushing), and a largest first variant (again, with and without pushing).

The experiments show that, as expected, algorithmic policies that prioritize the target object and the larger objects (if the target is not visible) are better. However, a reader might argue that from looking closely at the figures in the paper, the difference in performance among the 4 algorithmic policies other than the random policy may be minor.

That being said, as a paper that introduces the mechanical search problem, they have a mandate to test the simplest types of policies possible. The conclusion correctly points out that an interesting avenue for future work is to do reinforcement learning. Did they do that?

Yes! This is good news for those of us who like to see research progress, and bad news for those who were trying to beat the authors to it. That’s the purpose of their follow-up IROS 2020 paper, Visuomotor Mechanical Search. It fills in the obvious gap made from the ICRA 2019 paper: that performance is limited by algorithmic policies, which are furthermore restricted to linear pushes parameterized by an initial point and then a push direction. Properly-trained learning-based policies that can perform continuous pushing strategies should be able to better generalize to complex configurations than algorithmic ones.

Since naively applying Deep RL is very sample inefficient, the paper proposes an approach combining three components:

  • Demonstrations. It’s well-known that demonstrations are helpful in mitigating exploration issues, a topic I have previously explored on this blog.
  • Asymmetric Information. This is a fancy way of saying that during training, the agent can use information that is not available at test time. This can be done when using simulators (as in my own work, for example) since the simulator includes detailed information such as ground-truth object positions which are not easily accessible from just looking at an image.
  • Mid-Level Representations. This means providing the policy (i.e., actor) not the raw RGB image, but something “mid-level.” Here, “mid-level” means the segmentation mask of the target object, plus camera extrinsics and intrinsics. These are what actually get passed as input to the mechanical search policy, and the logic for this is that the full RGB image would be needlessly complex. It is better to just isolate the target object. Note that the full depth image is passed as input — the mid-level representation just replaces the RGB component.

In the MDP formulation for visuomotor mechanical search, observations are RGBD images and the robot’s end-effector, actions are relative end-effector changes, and the reward is a shaped and hand-tuned to encourage the agent to make the target object visible. While I have some concerns about shaping rewards in general, it seems to have worked for them. While the actor policy takes in the full depth image, it simultaneously consumes the mid-level representation of the RGB observation. In simulation, one can derive the mid-level representation from ground-truth segmentation masks provided by PyBullet simulation. They did not test on physical robots, but they claim that it should be possible to use a trained segmentation model.

Now, what about the teachers? They define three hard-coded teachers that perform pushing actions, and merge the teachers as demonstrators into the “AC-Teach” framework. This is the authors’ prior paper that they presented at CoRL 2019. I read the paper in quite some detail, and to summarize, it’s a way of performing training that can combine multiple teachers together, each of which may be suboptimal or only cover part of the state space. The teachers use privileged information by not using images but rather using positions of all objects, both the target and the non-target(s).

Then, with all this, the actor $\pi_\theta(s)$ and critic $Q_\phi(s, a)$ are updated using standard DDPG-style losses. Here is Figure 2 from the visuomotor mechanical search paper, which summarizes the previous points:


Remember that the policy executes these actions continuously, without retracting the arm after each discrete push, as done in the method from the ICRA 2019 paper.

They conduct all experiments in PyBullet simulation, and extensively test by ablating on various components. The experiments focus on either a single-heap or a dual-heap set of objects, which additionally tests if the policy can learn to ignore the “distractor” heap (i.e., the one without the target object in it) in the latter setting. The major future work plan is to address failure cases. I would also add that the authors could consider applying this on a physical robot.

These two papers give a nice overview of two flavors of mechanical search. The next two papers also relate to mechanical search, and utilize something known as learned occupancy distributions. Let’s dive in to see what that means.

X-RAY and LAX-RAY

In an IROS 2020 paper, Danielczuk and collaborators introduce the idea of X-RAY for mechanical search of occluded objects. To be clear: there was already occlusion present in the prior works, but this work explicitly considers it. X-RAY stands for maXimize Reduction in support Area of occupancY distribution. The key idea is to use X-RAY to estimate “occupancy distributions,” a fancy way of labeling each bounding box in an image with the likelihood that it contains the target object.

As with the prior works, there is an MDP formulation, but there are a few other important definitions:

  • The modal segmentation mask: regions of pixels in an image corresponding to a given target object which are visible.
  • The amodal segmentation mask: regions of pixels in an image corresponding to a given target image which are either visible or invisible. Thus, the amodal segmentation mask must contain the modal segmentation mask, as it has both the visible component, plus any invisible stuff (which is where the occlusion happens).
  • Finally, the occupancy distribution $\rho \in \mathcal{P}$: the unnormalized distribution describing the likelihood that a given pixel in the observation image contains some part of the target object’s amodal segmentation mask.

This enables them to utilize the following reward function to replace a sparse reward:

\[\tilde{R}(\mathbf{y}_k, \mathbf{y}_{k+1}) = |{\rm supp}(f_\rho(\mathbf{y}_{k}))| - |{\rm supp}(f_\rho(\mathbf{y}_{k+1}))|\]

where \(f_\rho\) is a function that takes in an observation \(\mathbf{y}_{k}\) (following the paper’s notation) and produces the occupancy distribution \(\rho_k\) for a given bounding box, and where \(|{\rm supp}(\rho)|\) for a given support \(\rho\) (dropping the $k$ subscript for now) is the number of nonzero pixels in \(\rho\).

Why is this logical? By reducing the occupancy distribution, one decreases the number pixels that MIGHT occlude the target objects, hence reducing uncertainty. Said another way, increasing this reward gives us greater certainty as to where the target object is located, which is an obvious prerequisite for mechanical search.

The paper then describes (a) how to estimate $f_\rho$ in a data-driven manner, and then (b) how to use this learned $f_\rho$, along with $\tilde{R}$, to define a greedy policy.

There’s an elaborate pipeline for generating the training data. Originally I was confused about their procedure for translating the target object. But after reading carefully and watching the supplementary video, I understand; it involves simulating a translation and rotation while keeping objects fixed. Basically, they pretend they can repeatedly insert the target object at specific locations underneath a pile of distractor objects, and if it results in the same occupancy distribution, then they can include such images in the data to expand the occupancy distribution to its maximum possible area (by aggregating all the amodal maps), meaning that estimates of the occupancy distribution are a lower bound on the area.

As expected, they train using a Fully Convolutional Network (FCN) with a pixel-wise MSE loss. You can think of this loss as taking the target image and the image produced from the FCN, unrolling them into long vectors \(\mathbf{x}_{\rm targ}\) and \(\mathbf{x}_{\rm pred}\), then computing

\[\|\mathbf{x}_{\rm targ} - \mathbf{x}_{\rm pred}\|_2^2\]

to find the loss. This glosses over a tiny detail: the network actually predicts occupancy distributions for different aspect ratios (one per channel in the output image) and only the channel with the similar input aspect ratio gets considered for the loss. Not a huge deal to know if you’re skimming the paper: it probably suffices to just realize that it’s the standard MSE.

Here is the paper’s key overview figure:


They propose to plan a grasp with the most amount of corresponding occupancy area. Why? A pick and place at that spot will greatly reduce the subsequent occupancy area of the target object.

It is nice that these FCNs can reasonably predict occupancy distributions for target objects unseen in training, and that it can generalize to the physical world without actually training on physical images. Training on real images would be harder since depth images would likely be noisier.

The two future works they propose are: relieving themselves of the assumption that the target object is flat, and (again) saying that they will do reinforcement learning. This paper was concurrent with the visuomotor mechanical search, but that paper did not technically employ X-RAY, so I suppose there is room to merge the two.

Next, what about the follow-up work of LAX-RAY? This addresses an obvious extension in that instead of top-down grasping, one can do lateral grasping, where the robot arm moves horizontally instead of vertically. This enables application to shelves. Here’s the figure summarizing the idea:


We can see that a Fetch robot has to reveal something deep into the shelf by pushing objects in front to either the left or the right. The robot has a long thin board attached to its gripper, it’s not the usual Fetch gripper. The task ends as soon as the target object, known beforehand, is revealed.

As with standard X-RAY, the method involves using a Fully Convolutional Network (FCN) to map from an image of the shelf to a distribution of where the target object could be. (Note: the first version of the arXiv paper says “fully connected” but I confirmed with the authors that it is indeed an FCN, which is a different term.) This produces a 2D image. Unlike X-RAY, LAX-RAY maps this 2D occupancy distribution to a 1D occupancy distribution. The paper visualizes these 1D occupancy distributions by overlaying them on depth images. The math is fairly straightforward on how to get a 1D distribution: just consider every “vertical bar” in the image as one point in the distribution, then sum over the values from the 2D occupancy distribution. That’s how I visualize it.

The paper proposes three policies for lateral-access mechanical search:

  • Distribution Area Reduction (DAR): ranks actions based on overlap between the object mask and the predicted occupancy distribution, and picks the action that reduces the sum the most. This policy is the most similar, in theory, to the X-RAY policy: essentially we’re trying to “remove” the occupancy distribution to reduce areas where the object might be occluded.
  • Distribution Entropy Reduction over n Steps (DER-n): this tries to predict what the 1D occupancy distribution will look like over $n$ steps, and then picks the one with lowest entropy. Why does this make sense? Because lower entropy means the distribution is less spread out, and concentrated towards one area, telling us where the occluded item is located. The authors also introduce this so that they can test with multi-step planning.
  • Uniform: this tests a DAR ablation by removing the predicted occupancy distribution.

They also introduce a First-Order Shelf Simulator (FOSS), a simulator they use for fast prototyping, before experimenting with the physical Fetch robot.

What are some of my thoughts on how they can build upon this work? Here are a few:

  • They can focus on grasping the object. Right now the objective is only to reveal the object, but there’s no actual robot grasp execution. Suctioning in a lateral direction might require more sensitive controls to avoid pushing the object too much, as compared to top-down where gravity stops the target object from moving away.
  • The setup might be a bit constrained in that it assumes stuff can be pushed around. For example consider a vase with water and flowers. Those might be hard to push, and are at risk of toppling.

Parting Thoughts

To summarize, here is how I view these four papers grouped together:

  • Paper 1: introduces and formalizes mechanical search, and presents a study of 5 algorithmic (i.e., not learned) policies.
  • Paper 2: extends mechanical search to use AC-Teach for training a learned policy that can execute actions continually.
  • Paper 3: combines mechanical search with “occupancy distributions,” with the intuition being that we want the robot to check the most likely places where an occluded object could be located.
  • Paper 4: extends the prior paper to handle lateral access scenarios, as in shelves.

What are some other thoughts and takeaways I have?

  • It would be exciting to see this capability mounted onto a mobile robot, like the HSR that we used for our bed-making paper. (We also used a Fetch, and I know the LAX-RAY paper uses a Fetch, but the Fetch’s base stayed put during LAX-RAY experiments.) Obviously, this would not be novel from a research perspective, so something new would have to be added, such as adjustments to the method to handle imprecision due to mobility.
  • It would be nice to see if we can make these apply for deformable bags, i.e., replace the bins with bags, and see what happens. I showed that we can at least simulate bagging items in PyBullet in some concurrent work.
  • There’s also a fifth mechanical search paper, on hierarchical mechanical search, also under review for ICRA 2021. I only had time to skim it briefly and did not realize it existed until after I had drafted the majority of this blog post. I have added it in the reference list below.

References










DAgger Versus SafeDAgger

Nov 7, 2020

The seminal DAgger paper from AISTATS 2011 has had a tremendous impact on machine learning, imitation learning, and robotics. In contrast to the vanilla supervised learning approach to imitation learning, DAgger proposes to use a supervisor to provide corrective labels to counter compounding errors. Part of this BAIR Blog post has a high-level overview of the issues surrounding compounding errors (or “covariate shift”), and describes DAgger as an on-policy approach to imitation learning. DAgger itself — short for Dataset Aggregation — is super simple and looks like this:

  • Train \(\pi_\theta(\mathbf{a}_t \mid \mathbf{o}_t)\) from demonstrator data \(\mathcal{D} = \{\mathbf{o}_1, \mathbf{a}_1, \ldots, \mathbf{o}_N, \mathbf{a}_N\}\).
  • Run \(\pi_\theta(\mathbf{a}_t \mid \mathbf{o}_t)\) to get an on-policy dataset \(\mathcal{D}_\pi = \{\mathbf{o}_1, \ldots, \mathbf{o}_M\}\).
  • Ask a demonstrator to label $\mathcal{D}_\pi$ with actions $\mathbf{a}_t$.
  • Aggregate $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_{\pi}$ and train again.

with the notation borrowed from Berkeley’s DeepRL course. The training step is usually done via standard supervised learning. The original DAgger paper includes a hyperparameter $\beta$ so that the on-policy data is actually generated with a mixture:

\[\pi = \beta \pi_{\rm supervisor} + (1-\beta) \pi_{\rm agent}\]

but in practice I set $\beta=0$, which in this case means all states are generated from the learner agent, and then subsequently labeled from the supervisor.

DAgger is attractive not only in practice but also in terms of theory. The analysis of DAgger relies on mathematical ingredients from regret analysis and online learning, as hinted by the paper title: “A Reduction of Imitation learning and Structured Prediction to No-Regret Online Learning.” You can find some relevant theory in (Kakade and Tewari, NeurIPS 2009).

The Dark Side of DAgger

Now that I have started getting used to reading and reviewing papers in my field, I can more easily understand tradeoffs in algorithms. So, while DAgger is a conceptually simple and effective method, what are its downsides?

  • We have to request the supervisor for labels.
  • This has to be done for each state the agent encounters when taking steps in an environment.

Practitioners can mitigate these by using a simulated demonstrator, as I have done in some of my robot fabric manipulation work. In fact, I’m guessing this is the norm in machine learning research papers that use DAgger. This is not always feasible, however, and even with a simulated demonstrator, there are advantages to querying less often.

Keeping within the DAgger framework, an obvious solution would be to only request labels for a subset of data points. That’s precisely what the SafeDAgger algorithm, proposed by Zhang and Cho, and presented at AAAI 2017, intends to accomplish. Thus, let’s understand how SafeDAgger works. In the subsequent discussion, I will (generally) use the notation from the SafeDAgger paper.

SafeDAgger

The SafeDAgger paper has a nice high-level summary:

In this paper, we propose a query-efficient extension of the DAgger, called SafeDAgger. We first introduce a safety policy that learns to predict the error made by a primary policy without querying a reference policy. This safety policy is incorporated into the DAgger’s iterations in order to select only a small subset of training examples that are collected by a primary policy. This subset selection significantly reduces the number of queries to a reference policy.

Here is the algorithm:


SafeDAgger uses a primary policy $\pi$ and a reference policy \(\pi^*\), and introduces a third policy $\pi_{\rm safe}$, known as the safety policy, which takes in the observation of the state $\phi(s)$ and must determine whether the primary policy $\pi$ is likely to deviate from a reference policy \(\pi^*\) at $\phi(s)$.

A quick side note: I often treat “states” $s$ and “observations” $\phi(s)$ (or $\mathbf{o}$ in my preferred notation) interchangeably, but keep in mind that these technically refer to different concepts. The “reference” policy is also often referred to as a “supervisor,” “demonstrator,” “expert,” or “teacher.”

A very important fact, which the paper (to its credit) repeatedly accentuates, is that because $\pi_{\rm safe}$ is called at each time step to determine if the reference must be queried, $\pi_{\rm safe}$ cannot query \(\pi^*\). Otherwise, there’s no benefit — one might as well dispense with $\pi_{\rm safe}$ all together and query \(\pi^*\) normally for all data points.

The deviation $\epsilon$ is defined with the $L_2$ distance:

\[\epsilon(\pi, \pi^*, \phi(s)) = \| \pi(\phi(s)) - \pi^*(\phi(s)) \|_2^2\]

since actions in this case are in continuous land. The optimal safety policy $\pi_{\rm safe}^*$ is:

\[\pi_{\rm safe}^*(\pi, \phi(s)) = \begin{cases} 0, \quad \mbox{if}\; \epsilon(\pi, \pi^*, \phi(s)) > \tau \\ 1, \quad \mbox{otherwise} \end{cases}\]

where the cutoff $\tau$ is user-determined.

The real question now is how to train $\pi_{\rm safe}$ from data \(D = \{ \phi(s)_1, \ldots, \phi(s)_N \}\). The training uses the binary cross entropy loss, where the label is “are the two policies taking sufficiently different actions”? For a given dataset $D$, the loss is:

\[\begin{align} l_{\rm safe}(\pi_{\rm safe}, \pi, \pi^*, D) &= - \frac{1}{N} \sum_{n=1}^{N} \pi_{\rm safe}^*(\phi(s)_n) \log \pi_{\rm safe}(\phi(s)_n, \pi) + \\ & (1 - \pi_{\rm safe}^*(\phi(s)_n)) \log(1 - \pi_{\rm safe}(\phi(s)_n, \pi)) \end{align}\]

again, here, \(\pi_{\rm safe}^*\) and \((1-\pi_{\rm safe}^*)\) represent ground-truth labels for the cross entropy loss. It’s a bit tricky; the label isn’t something inherent in a training data, but something SafeDAgger artificially enforces to get desired behavior.

Now let’s discuss the control flow of SafeDAgger. The agent collects data by following a safety strategy. Here’s how it works: at every time step, if $\pi_{\rm safe}(\pi, \phi(s)) = 1$, let the usual agent take actions. Otherwise, $\pi_{\rm safe}(\pi, \phi(s)) = 0$ (remember, this function is binary) and the reference policy takes actions. Since this is done at each time step, the reference policy can return control to the agent as soon as it is back into a “safe” state with low action discrepancy.

Also, when the reference policy takes actions, these are the data points that get labeled to produce a subset of data $D’$ that form the input to $l_{\rm safe}$. Hence, the process of deciding which subset of states should be used to query the reference happens during environment interaction time, and is not a post-processing event.

Training happens in lines 9 and 10 of the algorithm, which updates not only the agent $\pi$, but also the safety policy $\pi_{\rm safe}$.

Actually, it’s somewhat strange why the safety policy should help out. If you notice, the algorithm will continually add new data to existing datasets, so while $D_{\rm safe}$ initially produces a vastly different dataset for $\pi_{\rm safe}$ training, in the limit, $\pi$ and $\pi_{\rm safe}$ will be trained on the same dataset. Line 9, which trains $\pi$, will make it so that for all $\phi(s) \in D$, we have \(\pi(\phi(s)) \approx \pi^*(\phi(s))\). Then, line 10 trains $\pi_{\rm safe}$ … but if the training in the previous step worked, then the discrepancies should all be small, and hence it’s unclear why we need a threshold if we know that all observations in the data result in similar actions between \(\pi\) and \(\pi^*\). In some sense $\pi_{\rm safe}$ is learning a support constraint, but it would not be seeing any negative samples. It is somewhat of a philosophical mystery.

Experiments. The paper uses the driving simulator TORCS with a scripted demonstrator. (I have very limited experience with TORCS from an ICRA 2019 paper.)

  • They use 10 tracks, with 7 for training and 3 for testing. The test tracks are only used to evaluate the learned policy (called “primary” in the paper).

  • Using a histogram of squared errors in the data, they decide on $\tau = 0.0025$ as the threshold so that 20 percent of initial training samples are considered “unsafe.”

  • They report damage per lap as a way to measure policy safety, and argue that policies trained with SafeDAgger converge to a perfect, no-damage policy faster than vanilla DAgger. I’m having a hard time reading the plots, though — their “SafeDAgger-Safe” curve in Figure 2 appears to be perfect from the beginning.

  • Experiments also suggest that as the number of DAgger iterations increases, the proportion of time driven by the reference policy decreases.

Future Work? After reading the paper, I had some thoughts about future work directions:

  • First, SafeDAgger is a broadly applicable algorithm. It is not specific to driving, and it should be feasible to apply to other imitation learning problems.

  • Second, the cost is the same for each data point. This is certainly not the case in real life scenarios. Consider context switching: one can request the reference for help in time steps 1, 3, 5, 7, and 9, or it can request the reference for help in times 3, 4, 5, 6, and 7. Both require the same raw number of references, but it seems intuitive in some way that given a fixed budget of time, a reference policy should want a contiguous time step.

  • Finally, one downside strictly from a scientific perspective is that there are no other baseline methods tested other than vanilla DAgger. I wonder if it would be feasible to compare SafeDAgger with an approach such as SHIV from ICRA 2016.

Conclusion

To recap: SafeDAgger follows the DAgger framework, and attempts to reduce the number of queries to the reference/supervisor policy. SafeDAgger predicts the discrepancy among the learner and supervisor. Those states with high discrepancy are those which get queried (i.e., labeled) and used in training.

There’s been a significant amount of follow-up work on DAgger. If I am thinking about trying to reduce supervisor burden, then SafeDAgger is among the methods that come to my mind. Similar algorithms may get increasingly used in machine learning if DAgger-style methods become more pervasive in machine learning research, and in real life.










How I Made My IROS 2020 Conference Presentation Video

Sep 21, 2020


This is my official video presentation for IROS 2020.

The 2020 International Conference on Intelligent Robots and Systems (IROS) will be virtual. It was planned to be in Las Vegas, Nevada, from October 25-29. While this was unfortunately expected, I understand the need to reduce large gatherings, as the pandemic is still happening here. I wish our government, and private citizens, could look around the world and see where things are going right regarding COVID-19; for example, Taiwan is having 10,000 person concerts and has all of seven recorded deaths as of today, while the United States still has heavy restrictions on in-person gatherings with well over 200,000 deaths (here’s the source I’ve been checking to track this information).

For IROS 2020, I am presenting a paper on robot fabric manipulation, done in collaboration with wonderful colleagues from Berkeley and Honda Research Institute. IROS 2020 asked us to create a 15-minute video for each paper, and my final product is shown above and also available on my YouTube channel. This is by far the longest pre-recorded video I have ever made for a conference. I believe it’s also my first video with audio. Normally, my research videos are just a handful of minutes long, and if I need to clarify things in the video, I add text (subtitles) manually in the iMovie application. For my IROS video, however, I wanted to make the video longer with audio, but I also knew I needed a more scalable way to add subtitles, which would be necessary for me to completely understand the video if I were to re-watch it many years later. I also wanted to add subtitles and to make them unavoidably visible to encourage other researchers to add subtitles to their videos.

Here is the backstory of how I made this video.

First, as part of my research that turned into this paper, I had many short video clips of a robot manipulating fabric in iMovie on my MacBook Pro laptop. I started a fresh iMovie file, and picked the robot videos that I wanted to include.

Then, I created a new Google Slides and a new Google Doc. In the Google Slides file, I created the slides that I wanted to show in the final video. These slides were mostly copied and pasted from earlier, internal research presentations, and reformatted to a consistent font and size style.

In the Google Doc, I wrote down my entire transcript, which turned out to be slightly over four pages. I then practiced my audio by stating what I wrote on the transcript, peppered with my usual enthusiasm. I also tried to avoid talking too fast. I used the voice Memos app on my iPhone to record audio. I made multiple audio files, each about one minute long. This made it simpler to redo any audio (which I had to do frequently) since I only had to redo small portions instead of the entire video’s audio.

Once I felt like the slides were ready, and that they aligned well with the audio, I put in each slide and audio file into iMovie, carefully adjusting the time ranges to align them, and to make sure the video did not exceed the 15-minute limit. I made further edits and improvements to the video after getting feedback from my colleagues. When I was sufficiently satisfied with the result, I saved and got an .mp4 video file.

But what about adding subtitles?

iMovie contains functionality for adding subtitles, but the process is manual and highly cumbersome. After some research, I found this video tutorial which demonstrates how to use Kapwing to add subtitles. Kapwing is entirely web-based, so there’s no need to download it locally – I can upload videos to their website and edit in a web browser.

I can add subtitles to Kapwing by uploading audio files, and Kapwing will use automatic speech recognition to generate an initial draft, which I then fine-tune. Here is the interface for adding subtitles:


I paid 20 USD for a monthly subscription so that I could create a longer video, and followed the tutorial mentioned earlier to add subtitles. Eventually, I got my 15-minute video, which just barely fit under the 50MB file limit as mandated by IROS. I uploaded it to the conference, as well as to YouTube, which is the one at the top of this post.

I am happy with the final video product. That said, the process of adding subtitles was not ideal:

  • The automatic speech recognition for producing an initial guess at the subtitles is … bad. I mean, really bad. I guess it got less than 5% of my audio correct, so in practice I was adding all of my subtitles by manually copying and pasting from my Google Doc. To put things in perspective, Google Meet (my go-to video conferencing tool these days) handles my audio far better, with subtitles that are remarkably highly quality.

  • The interface for subtitles is also cumbersome to use, though to be fair, it’s an improvement over iMovie. As shown in the screenshot above, when re-editing a video, it doesn’t seem to preserve the ordering of the subtitles (notice how my first line in the video is listed second above). Furthermore, when editing and then clicking “Done”, I sometimes saw subtitles with incorrect sizes, so I had to re-edit the video … only to see a few subtitles disappear each time I did this. There also did not seem to be a way to change the subtitle size for all subtitles simultaneously. My solution was to forget about saving in progress, and to painstakingly go through each subtitle to change the size by manually clicking via a drop-down menu.

I hope this was useful! It is likely that future conferences will continue to be virtual in some way. For example, I am attempting to submit several papers to ICRA 2021, which will be in Xi’an, China, next summer. The website says ICRA 2021 will be a hybrid event with a mix of virtual and in-person events, but I would bet that many travel restrictions will still be in place, particularly for researchers from the United States. For that, and several other reasons, I am almost certainly going to be a virtual attendee, so I may need to revisit these instructions when making additional video recordings.

As always, thank you for reading, stay safe, and wear a mask.










The Virtual 2020 Robotics: Science and Systems Conference

Aug 23, 2020

I have attended eight international academic conferences that contain refereed paper proceedings. My situation is much different nowadays as compared to the middle of 2016, when I had attended zero academic conferences, thought my research career was going nowhere, and that I would leave Berkeley without a PhD.

The most recent conference I attended was also my first virtual one, the 2020 Robotics: Science and Systems, one of the world’s premier robotics conferences. It occurred last month, and in keeping up with the tradition of my blog, I will briefly discuss what happened while adding some thoughts about virtual conferences.

RSS 2020: The Workshop Days

RSS 2020 consisted of five days, with the first two dedicated to workshops. For the two workshop days, I checked the schedule in advance and decided on one workshop per day to attend, to avoid overextending my attention span and to keep my schedule manageable. For the first day, I attended the 2nd Workshop on Closing the Reality Gap in Sim2Real Transfer for Robotics. It was a fun one: much of it consisted of pre-recorded, two-on-two debates addressing controversial statements:

  • “Investing into Sim2Real is a waste of time and money”
  • “Sim2Real is old news. It’s just X (X=model-based RL, X=domain randomization, X=system identification)”
  • “Sim2Real requires highly accurate physical simulators and photorealistic rendering”

I am a huge fan of Sim2Real, and several of my papers use the technique, so I was especially galled by the first claim. Surely it can’t possibly be a waste of time and money? (Full disclosure: one of my PhD advisors agrees with me and was in the debate arguing against the first claim, but I still would hold that belief even if he was not involved. You’ll have to take my word on that.) Going through the debate was enjoyable – despite my opposition to the statement, I appreciated the perspectives of the two CMU professors, Abhinav Gupta and Chris Atkeson, arguing in favor of the claim. While researching the academic publications of those professors, I found Chris Atkeson’s impressive and persuasive 100-page paper providing his advice for starting graduate students, which features some Sim2Real discussion.

Rather than try to further describe my messy notes, I will refer you to my former colleague (and now CMU PhD student) Jacky Liang, who wrote a nice summary of the workshop. I am still going to be doing some Sim2Real work in the near future, and particularly nowadays due to the pandemic limiting access to physical robots, an obvious point that was somehow only articulated at the end of the workshop, by Berkeley Professor Anca Dragan.

For the next day, I attended the workshop on Self-Supervised Robot Learning. This was a four-hour workshop, and one that was more traditional in the sense that it was a series of longer talks by professors and research scientists, with shorter “lightning talks” by authors of accepted workshop papers. I chose to attend this because I am very interested in the topic, and think getting automatic supervision without tedious, manual labeling is key for scaling up robots to the real world. Here’s a relevant blog post of mine if you would like to read more.

There were seven speakers, and I have personally spoken to six of them (all except Abhinav Gupta):

  • Dieter Fox: discussed KinectFusion and self-correspondences with descriptors. I knew some of this material from reading the relevant papers, and (you guessed it) I have a relevant blog post.
  • Abhinav Gupta: talked about much of newly-appointed Professor Lerrel Pinto’s work in scaling up robot learning. I have read almost all of Lerrel Pinto’s early papers and was pleased to see them resurface here.
  • Pierre Sermanet: discussed his “learning from play” papers which involve planning and learning from language. It’s fascinating stuff, and I have his papers on my “to read” list.
  • Roberto Calandra: provided a series of “lessons learned” in doing robot learning research, and commented about how COVID-19 might mandate more self-supervised robots that can run on their own.
  • Chelsea Finn: presented a chronology over the last 5 years about how we acquire data for robot learning, and how we can make this scale up. Critically, we need to broaden the training data distribution to cover more test-time scenarios.
  • Pieter Abbeel: presented the CURL and RAD papers which suggest that learning from pixels can be as efficient as learning from state. I have read the papers in some detail, and helped with formatting the recent BAIR blog post about CURL and RAD.
  • Andy Zeng: provided his thoughts on the “object-ness” assumption in robot learning, and how he was able to get automatic labels for his papers. I described some of his great work in this blog post. I am also very fortunate to have him as my Google summer internship host!

It was a great workshop with great speakers.

RSS 2020: The Conference Days

The next three days were the formal conference days. In general, the schedule was similar for each day. Each had live talks and two hours of live poster sessions of accepted papers, all happening over Zoom. RSS is still a relatively small robotics conference, with only 103 accepted papers in 2020, in contrast to ICRA and IROS which now have well over 1000 accepted papers each year. This meant that RSS was “single track,” so only one thing was formally happening at once.

After the opening talks to introduce us to virtual RSS, we had the first of the two-hour paper discussion sections. I stayed primarily in the Zoom room allocated to my paper at RSS 2020.

University of Washington Professor Byron Boots gave an “early career” talk in the afternoon, featuring online learning and regret analysis, which befits his publication list. Some of his work involves analyzing Model Predictive Control (MPC), and once again I felt relieved about my RSS paper, which used MPC. Working on that project has made it so much easier for me to understand MPC and related topics.

The second day began with a Diversity and Inclusion panel, featuring people such as Michigan Professor Chad Jenkins. I watched the discussion and thought it went well. We then had the usual two hours of paper discussions. Most Zoom rooms were almost empty, with the exception of the paper authors. Honestly, I like this, because it made it easy for me to talk with various paper authors.

The keynote talk by MIT Professor Josh Tenenbaum later that day was excellent. He’s done great work in areas that overlap with robotics, most notably computer vision and psychology, and I was thinking about how I could incorporate his findings into my research agenda.

The third day of the conference began with a discussion and a town hall. Many conferences have started these discussions, which I suspect is in large part to solicit feedback on how to make conferences more inclusive to the research community. I recall that a conference organizer mentioned that we have professional real-time captioning for all the major talks, and praised it. I agree! There was some Q & A at the end, and one thought-provoking comment came from an audience member who thought that hybrid conferences that combine virtual and in-person events would not work well. While the commenter made it clear that he/she wanted to see a hybrid event work, there is a huge risk in creating an inequity to favor people who are there in-person over those who are attending virtually. It will be interesting to see what happens with ICRA 2021, which is planned to be in Xi’an, China, next May. The ICRA 2021 website is already saying that the conference will be hybrid.

After this, we had the usual paper discussions, followed by Stanford Professor Jeannette Bohg’s excellent early career talk. The day concluded with the paper awards and the farewell talk. First, congratulations to Google for winning several awards! Second, the conference organizers said they could not provide any definitive information about where RSS would be held next year.

RSS 2020: Thoughts

I read through various blog posts before attending RSS, such as one from Berkeley Professor Ben Recht and one from the organizers of ICLR 2020, which was one of the first conferences in 2020 that was forced to go virtual, so I had a rough sense of what to expect from a virtual conference. As usual, though, there’s no substitute for going through the process in person (I’m not sure if that should be a pun or not). Here are some brief thoughts:

  • The conference had some virtual rooms, which I think are called “gather” sessions, for informal chats. Unfortunately, almost every time I logged into these rooms, I was the only one there. Did people make heavy use of these? On a related note, there were a few slack channels for the workshops, but I think hardly anyone used them. Maybe Slack channels should be deprioritized for smaller virtual conferences?
  • Since it looks like many conferences will be virtual or hybrid going forward, perhaps we should get rid of the requirement that at least one author of each accepted paper has to physically attend the conference. Given the COVID-19 situation, and also geopolitical issues pertaining to visas and immigration, it seems like people ought to have the option to avoid travel.
  • Getting a smaller conference to be time-zone friendly is a huge challenge. With a larger one like ICLR, it’s possible to have the conference run 24/7 with something happening for each time zone, but I don’t know of a good solution for one the size of RSS.
  • I didn’t have the ability to set aside my entire week for the conference, since I was still interning at Google while this was happening, though I suppose I could have asked for a few days off. This meant I worked more on research than I usually do during conferences. I’m not sure if that’s a good thing or a bad thing.
  • While I didn’t ask questions during the talks, I think a virtual setting makes it easier for many of us to ask questions. In a physical conference, we might have to walk to a microphone in an auditorium of thousands of people.
  • As mentioned earlier, I like the smaller Zoom sessions that replace physical poster sessions. It was far easier for me to engage in substantive conversations with other researchers. In contrast, when I was at NeurIPS 2019, I could barely talk to any author given the size and the crowded, elbow-to-elbow poster sessions.
  • I thought it was easier to get an academic accommodation; I requested professional captioning. For a virtual conference, it isn’t necessary to pay for someone (e.g., a sign language interpreter) to physically travel, which can increase costs.

To conclude this blog post, I want to thank the RSS organizers. I know things aren’t quite ideal, but virtual RSS went well, and I hope to attend in 2021, whether in person or virtual.










On Anti-Racism

Jul 25, 2020

The last few months have taught us a lot about America. Our country is facing the twin crises of COVID-19 and racism. While the former is novel and the current crisis is in part (actually, largely) due to a lack of leadership by our top political officials, the latter is perhaps the oldest problem that stubbornly never disappears. In this post, I discuss, in order: policing (including one benign encounter with a police officer), anti-racism in academia and AI, and what I will try to do for my anti-racist education. I will discuss what I am reading, where I am donating to, and what I can commit to doing in the near future.

Policing. I was as appalled as many others from watching videos of police treatment of African-Americans in this country, especially with the George Floyd case, and I share the concerns many have over police conduct against Blacks. On the other hand, I also believe there has to be a police presence — or law enforcement more broadly — of some sort. The 1969 Murray-Hill riots in Canada, for example, where a strike by Montreal police lead to widespread lawless activity, demonstrates just how badly society depends on law enforcement, and makes me worry that the absence of police presence can lead to anarchy.

Growing up, I was told to be extra cautious around the police, and to make my hearing disability clear and upfront to any police officers to avoid misunderstandings. There have been tragic cases of deaf people being harmed and even killed by law enforcement officers who presumed a deaf person could hear and was engaging in indifference or disobedience of law enforcement commands. I know my situation is not the same as and is far milder than what many Blacks experience. While I am not white, people often think I am white by my physical appearance, so my racial composition has not been problematic.

In my life, I have been stopped by the police a grand total of zero times. Well, except for (arguably) one case where Berkeley was running a random “sobriety test” on a Friday night, and police officers were stopping every car on the street that led to my apartment. That night I wasn’t driving home from a party; I was working in the robotics lab until 9:00pm.

When it was my turn, my conversation with the police officer went like this:

Me: Hello. Nice to meet you. Just to let you know I’m deaf and may not fully understand everything you say. But I’m happy to answer any questions you have. I am curious about what is happening here.

Police officer [smiling]: Gotcha. This is a random test that we’re having to check all drivers here. In any case I don’t smell any alcohol on you, so you’re free to go.

That’s it! I have otherwise never spoken to a police officer in any driving-related context, and my few other interactions with police officers have similarly been under extraordinarily uneventful and non-threatening situations. When people such as United States Senator Tim Scott of South Carolina get stopped by the police at the Senate as he describes in this interview, then I wonder how our society can fix this.

That said, I also want to take a data-driven approach to let sober facts dictate my beliefs, rather than emotions or one-time events. Dramatic videos only show a small fraction of all police activity. Given the authority, trust, and power we give to police officers, however, the bar for their code-of-conduct should be high.

To summarize, I don’t think we should get rid of the police. I do believe we need to continue and improve training of police officers, the majority of whom do not have a college education, and to provide support (and better pay) to the good police officers while firing the bad ones. It may also be helpful if we can collectively reduce the need for police officers to deal with non-critical cases such as parking tickets and jaywalking so that they can prioritize the truly dangerous criminals. I can’t claim to be an expert on policing, so I will continue learning as much as I can about this area.

Anti-Racism in Academia and Artificial Intelligence. The Berkeley EECS department, like many similar ones in the country, is heavily dominated by Whites and Asians, and has very few Blacks, so discussion of race and racism (at least from my conversations) tend to involve the White/Asian dynamic with limited commentary about other groups.

The good news is that there’s been recent discussion about how to be anti-racist, with increased focus on Blacks. There was an email sent out by the chairs of the department which linked to statements by much of the faculty affirming their support for anti-racism. Several department-wide reading groups, email lists, and committees now exist for supporting anti-racism. A PhD student in the department, Devin Guillory, has a manuscript on combating anti-Blackness with a specific focus on the Artificial Intelligence community.

I think it’s important for the AI community to discuss the broader impacts of how our technologies can be used both for good and for bad, particularly when they can exacerbate existing disparities. One recent technology that is worth discussing is facial recognition. While I don’t do research in this area, my robotics research often uses technologies based on Deep Convolutional Neural Networks that form the bedrock for facial recognition.

Rarely is it easy to admit that one is wrong, but I think I was wrong about my initial stance on facial recognition. When I first learned about the capabilities of Convolutional Neural Networks from CS 280 at Berkeley and then the associated facial recognition literature, I dreamed of society deploying the technology to detect and catch criminals with surgical precision. (I don’t have an earlier blog post or other writing about this, so you’ll have to take my word on it.)

Since then, I’ve done almost a complete reversal and now think we should limit facial recognition research and technology, at least until we can come up with solutions that explicitly consider minority interests. Here’s why:

  • I share concerns over potential inaccuracies in the technology when it pertains to racial minorities. For example, a landmark 2018 paper by Joy Buolamwini and Timnit Gebru showed that facial recognition technologies (at least at the time of publication) were far more inaccurate on people with darker skin. While the technology may have gotten more accurate on people with darker skin since it was published, a recent news article about a wrongful arrest of a black man due to facial recognition makes me anxious.

  • I also worry about facial recognition being used to limit and control personal freedom. I see the extreme case of facial recognition technologies in China, where particularly in Xinjiang, they have an extensive surveillance system over the Uighur Muslims. While it’s challenging to make imperfect comparisons across different countries and governance systems, I hope that the United States does not reach this level of surveillance, and the situation there should serve as a warning sign for American residents to be wary of facial recognition systems in our own communities.

When the ACM made the following tweet a few months ago, I was heartened to see pushback by many members of the computer science community. I hope this causes the community to carefully consider the development of facial recognition technologies.


Left: a tweet the ACM sent out regarding facial recognition. (I believe this is the tweet; it's hard to find because they have deleted it.) Right: the ACM's apology.

Anti-Racism More Broadly. As mentioned earlier, as part of my broader anti-racism education, I am pursuing three separate activities which can be categorized as reading books, donating to organizations, and making commitments about my actions now and in the future.

First, in terms of books, I have been reading these in recent months:

  • Evicted: Poverty and Profit in the American City by Matthew Desmond (published 2016)
  • White Fragility: Why It’s So Hard for White People to Talk About Racism by Robin DiAngelo (published 2018)
  • So You Want to Talk About Race? by Ijeoma Oluo (published 2018)
  • Me and White Supremacy: Combat Racism, Change the World, and Become a Good Ancestor by Layla F. Saad (published 2020)
  • Stamped from the Beginning: The Definitive History of Racist Ideas in America by Ibram X. Kendi (published 2016)

I finished the first four books above, and recommend all of them. I am currently working through Ibram X. Kendi’s book. I enjoy reading the books — not, of course, in the sense that racism is “enjoyable” but because I think these are well-written, well-argued books that teach me.

In addition, I also commit to increasing the number of books I read about Blacks or by Black authors. Given that I post my reading list online (see the blog archives), it should be easy to keep me accountable.

Second, I have learned more about, and have donated to, these organizations:

All relate to tech: the first for young Black women, the second for Black researchers in AI, the third for Black and Latinx in tech, the fourth for under-represented minorities more broadly, and the fifth for low-income youth. There are other loosely related organizations that I support and have donated to in the past, but I think the above are the most relevant for the current blog post context.

Third, Going forward, I will commit to anti-racism. I will not shy away from discussing this topic. I will actively help with recruitment and retainment of Blacks within my work environment. I also will avoid comments that show insensitivity in race-related contexts, including but not limited to: “playing the race card,” “I don’t see color,” “All Lives Matter,” “I am not White,” or “I have Black friends.” I also will not claim that my research is entirely disjoint from race. My robotics research is less directly race-related as compared to facial recognition research, but that is different from saying that it has nothing to do with race.

I will be careful to consider a variety of perspectives when forming my own opinions for related events. It may be the case that I believe something which most of my nearby colleagues disagree with. We don’t have to agree on everything, but I would like the academic community to avoid cases similar to how US Senator Dick Durbin smeared fellow US Senator Tim Scott (since apologized), and more generally to avoid treating Blacks as a monolithic group.

Concluding Remarks. While this blog post is coming to a close, the process of being an anti-racist will be a lifelong process. I am never going to claim perfection or that I have passed some “anti-racist threshold” and am therefore one of the “good guys.” This is a lifelong process. I will make many mistakes along the way. I may discuss more about this in some future blog posts. In the meantime, let me know if you have comments or suggestions.










Regarding the ICE International Student Ban

Jul 19, 2020

I was going to email this letter to elected federal politicians, but fortunately, the U.S. Immigration and Customs Enforcement (ICE) seems to have repealed their misguided policy about forcing international students to take classes in-person. Nonetheless, here’s the letter, and in case a similar policy somehow re-emerges, I will start sending this message. This particular letter is addressed to U.S. Senator Dianne Feinstein given my California residency, her particular Senate Committee assignments, and because of the six offices I called last week, only hers had an actual human on the line for me to address my concerns. The letter is based on this template. Unfortunately I’m not sure who wrote it.

Dear Senator Dianne Feinstein,

My name is Daniel Seita. I currently reside in the San Francisco Bay Area. I am registered voter and thank you for your many years of service in the United States Senate representing California.

I am emailing to insist that you stop the recent student ban.

On July 6, 2020, the United States Immigration and Customs Enforcement announced that they will be modifying their Student and Exchange Visitor Program (SEVP) impacting F-1 and M-1 international students. Under the modified SEVP, F-1 and M-1 students with valid student visas would be forced to leave the United States if their college or university was not offering in-person classes.

International students pay the highest tuition to colleges and universities and shifting to an online-only syllabus does not reduce and shrink the economic burden of the high costs. Forcing international students to pay these high costs while also making them leave the country is unfair on many levels. Furthermore, the funds that international students bring in subsidize domestic students.

With the COVID-19 pandemic still spiking, the opening of in-person classes is unsafe and unnecessary. These new SEVP modifications force universities to choose between opening in-person classes even if it is not safe or lose their international student body who account for billions of dollars to the US economy.

International students have built lives for themselves while at school, and it is cruel to take it away. Students have signed leases and agreements, have possessions and belongings, and have loved ones and friends that they are being ripped apart from because of the unpredictable consequences of COVID-19. Many domestic students are unable to take classes in-person and it is an unfair expectation that international students who are here, legally, for school must be able to enroll in on-campus courses in order to stay in the country. With the fall semester rapidly approaching there is little time for students to transfer schools or find somewhere else to live.

The US has many of the best universities in the world, and a large part of that is due to immigration and international students. Our country has an unparalleled ability to recruit the best and brightest from all over the world, many of whom choose to stay in the country after their education. Without the contributions of international students and faculty, the quality of our education, research, and innovation would plummet.

I am a computer science PhD student at the University of California, Berkeley, and I work in artificial intelligence and robotics. I would guess that one-third of the people who I regularly collaborate with in my research are internationals. They have taught me so much about my field and have helped to raise my quality of research. Severing these collaborations will not only disrupt our research, but damage America’s global reputation.

I hope you consider these concerns and convince ICE to overturn the student ban.

Daniel Seita

Thanks to every international student and collaborator who teaches and inspires me.










When Deep Models for Visual Foresight Don't Need to be Deep

Jul 3, 2020

The virtual Robotics: Science and Systems (RSS) conference will happen in about a week, and I will be presenting a paper there. This is going to be my first time at RSS, and I was hoping to go to Oregon State University and meet other researchers in person, but alas, given the rapid disintegration of America as it pertains to COVID-19, a virtual meeting makes 100 percent sense. For RSS 2020, I’ll be presenting our paper VisuoSpatial Foresight for Multi-Step, Multi-Task Fabric Manipulation, co-authored with Master’s (and soon to be PhD!) student Ryan Hoque. This is based on a technique called visual foresight, and in this blog post, I’d like to briefly touch upon the technique, and then discuss a little more about our RSS 2020 paper, along with another surprising paper which shows that perhaps we need to rethink our deep models.

First, to make sure we’re on common ground here, what do people mean when we say the words “Visual Foresight”? This refers to the technique described in an ICRA 2017 paper by Chelsea Finn and Sergey Levine, which was later expanded upon in a longer journal paper with lead authors Chelsea Finn and Frederik Ebert. The authors are (or were) at UC Berkeley, my home institution, which is one reason why I learned about the technique.

Visual Foresight is typically used in a model-based RL framework. I personally categorize model-based methods into whether the models predict images or whether they predict some latent variables (assuming, of course, that the model itself needs to be learned). Visual Foresight applies to the former case for predicting images. In practice, given the difficult nature of image prediction, this is often done by predicting translations or deltas between images. For the second case of latent variable prediction, I refer you to the impressive PlaNet research from Google.

For another perspective on model-based methods, the following text is included in OpenAI’s “Spinning Up” guide for deep reinforcement learning:

Algorithms which use a model are called model-based methods, and those that don’t are called model-free. While model-free methods forego the potential gains in sample efficiency from using a model, they tend to be easier to implement and tune. As of the time of writing this introduction (September 2018), model-free methods are more popular and have been more extensively developed and tested than model-based methods.

and later:

Unlike model-free RL, there aren’t a small number of easy-to-define clusters of methods for model-based RL: there are many orthogonal ways of using models. We’ll give a few examples, but the list is far from exhaustive. In each case, the model may either be given or learned.

I am writing this in July 2020, and I believe that since September 2018, model-based methods have made enormous strides, to the point where I’m thinking that 2018-2020 might be known as the “model-based reinforcement learning” era. Also, to comment on a point from OpenAI’s text, while model-free methods might be easier to implement in theory, I argue that model-based methods can be far easier to debug, because we can check the predictions of the learned model. In fact, that’s one of the reasons why we took the model-based RL route in our RSS paper.

Anyway, in our RSS paper, we focused on the problem of deformable fabric manipulation. In particular, given a goal image of a fabric in any configuration, can we train a pick-and-place action policy that will manipulate the fabric from an arbitrary starting configuration to the goal configuration? For Visual Foresight, we trained a deep recurrent neural network model that could predict full 56x56 resolution images of fabric. We predicted depth images in addition to color images, making the model “VisuoSpatial.” Specifically, we used Stochastic Variational Video Prediction (SV2P) as our model. The wording “Stochastic Variational” means the model samples a latent variable before generating images, and the stochastic nature of that variable means the model is not deterministic. This is an important design aspect; see the SV2P paper for further details. But, as you might imagine, this is a very deep, recurrent, and complex model. Is all this complexity needed?

Perhaps not! In a paper at the Workshop on Algorithmic Foundations of Robotics (WAFR) this year, Terry Suh and Russ Tedrake of MIT show that, in fact, linear models can be effective in Visual Foresight.

Wait, really?

Let’s dive into that work in more detail, and see how it contrasts to our paper. I believe there are great insights to be gained from reading the WAFR paper.

In this paper, Terry Suh and Russ Tedrake focus on the task of pushing small objects into a target zone, such as pushing diced onions or carrots not unlike how a human chef might need do so. Their goal is to train a pushing policy that can learn and act based on greyscale images. They make a similar argument that we do in our RSS 2020 paper about the difficulty of knowing the “underlying physical state.” For us, “state” means vertices of cloth. For them, “state” means knowing all poses of objects. Since that’s hard with all these small objects piled upon each other, learning from images is likely easier.

The actions are 4D vectors $\mathbf{u}$ which have (a) the 2D starting coordinates, (b) the scalar push orientation, and (c) the scalar push length. They use Pymunk for simulation, which I’ve never heard of before. That’s odd, why not use PyBullet, which might be more standardized for robotics? I have explicitly been able to simulate this kind of environment in PyBullet.

That having been said, let’s consider first how (a) they determine actions, and (b) their visual foresight video prediction model.

Section 2.2 describes how they pick actions (for all methods they benchmark). Unlike us, they do not use the Cross Entropy Method (CEM) — there is no action sampling plus distribution refitting as happens in the CEM. The reason is that they can define a Lyapunov function which accurately characterizes performance on their task, and furthermore, they can minimize for it to get a desired action. The Lyapunov function $V$ is defined as:

\[V(\mathcal{X}) = \frac{1}{|\mathcal{X}|} \sum_{p_i \in \mathcal{X}} \min_{p_j \in \mathcal{S}_d} \|p_i - p_j\|_{p}\]

where \(\mathcal{X} = \{p_i\}\) is the set of all 2D particle positions, and $\mathcal{S}_d$ is the desired target set for the particles. The notation \(\| \cdot \|_p\) simply refers to a distance metric in the $p$-norm.


The figure above is from the paper, who visualizes the Lyapunov function. It is interpreted as a distance between a discrete set of points and a continuous target set. There’s a pentagon at the center indicating the target set. In their instantiation of the Lyapunov function, if all non-zero pixels (nonzero means carrots, due to height thresholding) in the image of the scene coincide with the pentagon, then the element-wise product of the two images is 0 everywhere, and summing it all will result in 0.

The paper makes the assumption that:

for every image that is not in the target set, we can always find a small particle to push towards the target set and decrease the value of the Lyapunov function.

I agree. While there are cases when pushing particles inwards might result in higher values (i.e., worse performance) due to pushing particles inside the zone to be outside of it, I think it is always possible to find some movement that gets a greater number of particles in the target. If anyone has a counter-example, feel free to share. This assumption may be more true for convex target sets, but I don’t think the authors make that assumption since they test on targets shaped “M”, “I”, and “T” later.

Overall, the controller appears to be accurate enough so that the prediction model performance is the main bottleneck. So which is better: deep or switched-linear? Let’s now turn to that, along with the “visual foresight” aspect of the paper.

Their linear model is “switched-linear”. This is an image-to-image mapping based on a linear map characterized by

\[y_{k+1} = \mathbf{A}_i y_k\]

for \(i \in \{1, 2, \ldots, |\mathcal{U}|\}\), where $\mathcal{U}$ is the discretized action space and $y_k \in \mathbb{R}^{N^2}$ represents the flattened $N \times N$ image at time $k$. Furthermore, $\mathbf{A}_i \in \mathbb{R}^{N^2 \times N^2}$. This is a huge matrix, and there are as many of these matrices as there are actions! This appears to require a lot of storage.

My first question after reading this was: when they train the model using pairs of current and successor images $(y_{k}, y_{k+1})$, is it possible to train all the $\mathbf{A}_i$ matrices?

Or are we restricted to only the matrix corresponding to the action that was chosen to transform $y_k$ into $y_{k+1}$? If this were true, that is a serious limitation. I breathed a sigh of relief when the authors clarified that they can reuse training samples, up to the push length. They discretized the push length by 5, and then got 1000 data points (image pairs) for each of those, for 5000 total. Then they find the optimum matrices (and actions, since matrices are actions here) by the ordinary least squares

Their deep models are referred to as DVF-Affine and DVF-Original. The affine one is designed for fairer comparison with the linear model, so it’s an image-to-image prediction model, with five separate neural networks for each of the discretized push lengths. DVF-Original takes the action as an additional input, while DVF-Affine does not.

Surprisingly, their results show that their linear model has lower prediction error on a held-out set of 1000 test images. This should directly translate to better performance on the actual task, since more accurate models mean the Lyapunov function will be driven down to 0 faster. Indeed, their results confirm the prediction error results, in the sense that linear models are the best or among the best in terms of task performance.

Now we get to the big question: why are linear models better than deeper ones for these experiments? I thought of these while reading the paper:

  • The carrots are very tiny in the images, so perhaps the 32x32 resolution makes it hard to accurately capture the fine-grained nature of the carrots.

  • The images are grayscale and small, which means linear models may work better as opposed to if the images were larger. At some point the “$N$” in their paper will grow too large to be used with linear models. (Of course with larger images, the problem of video prediction becomes exponentially harder. Heck, we only used 56x56 in our paper, and the SV2P paper used 64x64 images.)

  • Perhaps there’s just not enough data? It looks like the experiments use 23,000 data points to train DVF-Original, and 5,000 data points for DVF-Affine? For a point of comparison, we used about 105,000 images of cloth.

  • Furthermore, the neural networks are trained directly on the pixels in an end-to-end manner using the Frobenius norm loss (basically mean square error on pixels). In contrast, models such as SV2P are trained using Variational AutoEncoder style losses, which may be more powerful. In addition, the SV2P paper explicitly stated that they performed a multi-stage training procedure since a single end-to-end procedure tends to converge to less than ideal solutions.

  • Perhaps the problem has a linear nature to it? While reading the paper, I was reminded of the thought-provoking NeurIPS 2018 paper on how simple random search on linear models is competitive for reinforcement learning on MuJoCo environments.

  • Judging from Figure 11, the performance of the better neural network model seems almost as good as the linear one. Maybe the task is too easy?

Eventually, the authors discuss their explanation: they believe that their problem has natural linearity in it. In other words, there is inductive bias in the problem. Inductive bias in machine learning is a fancy way of saying that different machine learning models make different assumptions about the prediction problem.

Overall, the WAFR 2020 paper is effective and thought-provoking. It makes me wonder if we should have at least tried a linear model that could perhaps predict edges or corners of cloth while trying to abstract away other details. I doubt it would work for complex fabric manipulation tasks, but perhaps for simpler ones. Hopefully someone will explore this in the future!


Here are the papers discussed in this post, ordered by publication date. I focused mostly on the WAFR 2020 paper, and the others are: my paper with Ryan for RSS, the two main Visual Foresight papers, and the S2VP paper that uses the video prediction model we’ve used for our paper.










Offline (Batch) Reinforcement Learning: A Review of Literature and Applications

Jun 28, 2020

Reinforcement learning is a promising technique for learning how to perform tasks through trial and error, with an appropriate balance of exploration and exploitation. Offline Reinforcement Learning, also known as Batch Reinforcement Learning, is a variant of reinforcement learning that requires the agent to learn from a fixed batch of data without exploration. In other words, how does one maximally exploit a static dataset? The research community has grown interested in this in part because larger datasets are available that might be used to train policies for physical robots. Exploration with a physical robot may risk damage to robot hardware or surrounding objects. In addition, since offline reinforcement learning disentangles exploration from exploitation, it can help provide standardized comparisons of the exploitation capability of reinforcement learning algorithms.

Offline reinforcement learning, henceforth Offline RL, is closely related to imitation learning (IL) in that the latter also learns from a fixed dataset without exploration. However, there are several key differences.

  • Offline RL algorithms (so far) have been built on top of standard off-policy Deep Reinforcement Learning (Deep RL) algorithms, which tend to optimize some form of a Bellman equation or TD difference error.

  • Most IL problems assume an optimal, or at least a high-performing, demonstrator which provides data, whereas Offline RL may have to handle highly suboptimal data.

  • Most IL problems do not have a reward function. Offline RL considers rewards, which furthermore can be processed after-the-fact and modified.

  • Some IL problems require the data to be labeled as expert versus non-expert. Offline RL does not make this assumption.

I preface the IL descriptions with “some” and “most” because there are exceptions to every case and that the line between methods is not firm, as I emphasized in a blog post about combining IL and RL.

Offline RL is therefore about deriving the best policy possible given the data. This gives us the hope of out-performing the demonstration data, which is still often a difficult problem for imitation learning. To be clear, in tabular settings with infinite state visitation, it can be shown that algorithms such as Q-learning converge to an optimal policy despite potentially sub-optimal off-policy data. However, as some of the following papers show, even “off-policy” Deep RL algorithms such as the Deep Q-Network (DQN) algorithm require substantial amounts of “on-policy” data from the current behavioral policy in order to learn effectively, or else they risk performance collapse.

For a further introduction to Offline RL, I refer you to (Lange et al, 2012). It provides an overview of the problem, and presents Fitted Q Iteration (Ernst et al., 2005) as the “Q-Learning of Offline RL” along with a taxonomy of several other algorithms. While useful, (Lange et al., 2012) is mostly a pre-deep reinforcement learning reference which only discusses up to Neural Fitted Q-Iteration and their proposed variant, Deep Fitted Q-Iteration. The current popularity of deep learning means, to the surprise of no one, that recent Offline RL papers learn policies parameterized by deeper neural networks and are applied to harder environments. Also, perhaps unsurprisingly, at least one of the authors of (Lange et al., 2012), Martin Riedmiller, is now at DeepMind and appears to be working on … Offline RL.

In the rest of this post, I will summarize my view of the Offline RL literature. From my perspective, it can be roughly split into two categories:

  • those which try and constrain the reinforcement learning to consider actions or state-action pairs that are likely to appear in the data.

  • those which focus on the dataset, either by maximizing the data diversity or size while using strong off-policy (but not specialized to the offline setting) algorithms, or which propose new benchmark environments.

I will review the first category, followed by the second category, then end with a summary of my thoughts along with links to relevant papers.

As of May 2020, there is a recent survey from Professor Sergey Levine of UC Berkeley, whose group has done significant work in Offline RL. I began drafting this post well before the survey was released but engaged in my bad “leave the draft alone for weeks” habit. Professor Levine chooses a different set of categories, as his papers cover a wider range of topics, so hopefully this post provides an alternative yet useful perspective.

Off-Policy Deep Reinforcement Learning Without Exploration

(Fujimoto et al., 2019) was my introduction to Offline RL. I have a more extensive blog post which dissects the paper, so I’ll do my best to be concise in this post. The main takeaway is showing that most “off-policy algorithms” in deep RL will fail when solely shown off-policy data due to extrapolation error, where state-action pairs $(s,a)$ outside the data batch can have arbitrarily inaccurate values, which adversely affects algorithms that rely on propagating those values. In the online setting, exploration would be able to correct for such values because one can get ground-truth rewards, but the offline case lacks that luxury.

The proposed algorithm is Batch Constrained deep Q-learning (BCQ). The idea is to run normal Q-learning, but in the maximization step (which is normally $\max_{a’} Q(s’,a’)$), instead of considering the max over all possible actions, we want to only consider actions $a’$ such that $(s’,a’)$ actually appeared in the batch of data. Or, in more realistic cases, eliminate actions which are unlikely to be selected by the behavior policy $\pi_b$ (the policy that generated the static data).

BCQ trains a generative model — a Variational AutoEncoder — to generate actions that are likely to be from the batch, and a perturbation model which further perturbs the action. At test-time rollouts, they sample $N$ actions via the generator, perturb each, and pick the action with highest estimated Q-value.

They design experiments as follows, where in all cases there is a behavioral DDPG agent which generates the batch of data for Offline RL:

  • Final Buffer: train the behavioral agent for 1 million steps with high exploration, and pool all the logged data into a replay buffer. Train a new DDPG agent from scratch, only on that replay buffer with no exploration. Since the behavioral agent will have been learning along those 1 million steps, there should be high “state coverage.”

  • Concurrent: as the behavioral agent learns, train a new DDPG agent concurrently (hence the name) on the behavioral DDPG replay buffer data. Again, there is no exploration for the new DDPG agent. The two agents should have identical replay buffers throughout learning.

  • Imitation Learning: train the behavioral agent until it is sufficiently good, then run it for 1 million steps (potentially with more noise to increase state coverage) to get the replay buffer. The difference with “final buffer” is that the 1 million steps are all from the same policy, whereas the final buffer was throughout 1 million steps, which may have resulted in many, many gradient updates depending on the gradient-to-env-steps hyper-parameter.

The biggest surprise is that even in the concurrent setting, the new DDPG agent fails to learn well! To be clear: the agents start at the beginning with identical replay buffers, and the offline agent draws minibatches directly from the online agent’s buffer. I can only think of a handful of differences in the training process: (1) the randomness in the initial policy and (2) noise in minibatch sampling. Am I missing anything? Those factors should not be significant enough to lead to divergent performance. In contrast, BCQ is far more effective at learning offline from the given batch of DDPG data.

When reading papers, I often find myself wondering about the relationship between algorithms in batches (pun intended) of related papers. Conveniently, there is a NeurIPS 2019 workshop paper where Fujimoto benchmarks algorithms. Let’s turn to that.

Benchmarking Batch Deep Reinforcement Learning Algorithms

This solid NeurIPS 2019 workshop paper, by the same author of the BCQ paper, makes a compelling case for the need to evaluate Batch RL algorithms under unified settings. Some research, such as his own, shows that commonly-used off policy DeepRL algorithms fail to learn in an offline fashion, whereas (Agarwal et al., 2020) counter this, but with the caveat of using a much larger dataset.

One of the nice things about the paper is that it surveys some of the algorithms researchers have used for Batch RL, including Quantile Regression DQN (QR-DQN), Random Ensemble Mixture (REM), Batch Constrained Deep Q-Learning (BCQ), Bootstrapping Error Accumulation Reduction Q-Learning (BEAR-QL), KL-Control, and Safe Policy Improvement with Baseline Bootstrapping DQN (SPIBB-DQN). All these algorithms are specialized for the Batch RL setting with the exception of QR-DQN, which is a strong off-policy algorithm shown to work well in an offline setting.

Now, what’s the new algorithm that Fujimoto proposes? It’s a discrete version of BCQ. The algorithm is delightfully straightforward:


My “TL;DR”: train a behavior cloning network to predict actions of the behavior policy based on its states. For the Q-function update on iteration $k$, change the maximization over the successor state actions to only consider actions satisfying a threshold:

\[\mathcal{L}(\theta) = \ell_k \left(r + \gamma \cdot \Bigg( \max_{a' \; \mbox{s.t.} \; \frac{G_\omega(a'|s')}{\max \hat{a} \; G_\omega(\hat{a}|s')} > \tau} Q_{\theta'}(s',a') \Bigg) - Q_\theta(s,a) \right)\]

When executing the policy during test-time rollouts, we can use a similar threshold:

\[\pi(s) = \operatorname*{argmax}_{a \; \mbox{s.t.} \; \frac{G_\omega(a'|s')}{\max \hat{a} \; G_\omega(\hat{a}|s')} > \tau} Q_\theta(s,a)\]

Note the contrast where normally in Q-learning, we’d just do the max or argmax over the entire set of valid actions. Therefore, we will end up ignoring some actions that potentially have high Q-values, but that’s fine (and desirable!) if those actions have vastly over-estimated Q-values.

Some additional thoughts:

  • The parallels are obvious between $G_\omega$ in continuous versus discrete BCQ. In the continuous case, it is necessary to develop a generative model which may be complex to train. In the discrete case, it’s much simpler: run behavior cloning!

  • I was confused about why BCQ does the behavior cloning update of $\omega$ inside the for loop, rather than beforehand. Since the data is fixed, this seems suboptimal since the optimization for $\theta$ will rely on an inaccurate model $G_\omega$ during the first few iterations. After contacting Fujimoto, he agreed that it is probably better to move the optimization before the loop, but his results were not significantly better.

  • There is a $\tau$ parameter we can vary. What happens when $\tau = 0$? Then it’s simple: standard Q-learning, because any action should have non-zero probability from the generative model. Now, what about $\tau=1$? In practice, this is exactly behavior cloning, because when the policy selects actions it will only consider the action with highest $G_\omega$ value, regardless of its Q-value. The actual Q-learning portion of BCQ is therefore completely unnecessary since we ignore the Q-network!

  • According to the appendix, they use $\tau = 0.3$.

There are no theoretical results here; the paper is strictly experimental. The experiments are on nine Atari games. The batch of data is generated from a partially trained DQN agent over 10M steps (50M steps is standard). Note the critical design choice of whether:

  • we take a single fixed snapshot (i.e., a stationary policy) and roll it out to get steps, or
  • we take logged data from an agent during its training run (i.e., a non-stationary policy).

Fujimoto implements the first case, arguing that it is more realistic, but I think that claim is highly debatable. Since the policy is fixed, Fujimoto injects noise by setting $\epsilon=0.2$ 80% of the time, and setting $\epsilon=0.001$ otherwise. This must be done on a per-episode basis — it doesn’t make sense to change epsilons within an episode!

What are some conclusions from the paper?

  • Discrete BCQ seems to be the best of the “batch RL” algorithms tested. But the curves look really weird: BCQ performance shoots up to be at or slightly above the noise-free policy, but then stagnates! I should also add: exceeding the underlying noise-free policy is nice, but the caveat is that it’s from a partially trained DQN, which is a low bar.

  • For the “standard” off-policy algorithms of DQN, QR-DQN, and REM, QR-DQN is the winner, but still under-performs a noisy behavior policy, which is unsatisfactory. Regardless, trying QR-DQN in an offline setting, even though it’s not specialized for that case, might be a good idea if the dataset is large enough.

  • Results confirm some results from (Agarwal et al., 2020) in that distributional RL aids in exploitation), but that the success they were observing is highly specific to settings Agarwal used: a full 50M history of a teacher’s replay buffer, with a changing snapshot, plus noise from sticky actions.

Here’s a summary of results in their own words:

Although BCQ has the strongest performance, on most games it only matches the performance of the online DQN, which is the underlying noise-free behavioral policy. These results suggest BCQ achieves something closer to robust imitation, rather than true batch reinforcement learning when there is limited exploratory data.

This brings me to one of my questions (or aspirations, if you put it that way). Is it possible to run offline RL, and reliably exceed the noise-free behavior policy? That would be a dream scenario indeed.

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

This NeurIPS 2019 paper is highly related to Fujimoto’s BCQ paper covered earlier, in that it also focuses on an algorithm to constrain the distribution of actions considered when running Q-learning in a pure off-policy fashion. It identifies a concept known as bootstrapping error which is clearly described in the abstract alone:

We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it.

I immediately thought: what’s the difference between bootstrapping error here versus extrapolation error from (Fujimoto et al., 2019)? Both terms can be used to refer to the same problem of propagating inaccurate Q-values during Q-learning. However, extrapolation error is a broader problem that appears in supervised learning contexts, whereas bootstrapping is specific to reinforcement learning algorithms that rely on bootstrapped estimates.

The authors have an excellent BAIR Blog post which I highly recommend because it provides great intuition on how bootstrapping error affects offline Q-learning on static datasets. For example, this figure below shows that in the second plot, we may have actions $a$ that are outside the distribution of actions (OOD is short for out-of-distribution) induced by the behavior policy $\beta(a|s)$, indicated with the dashed line. Unfortunately, if those actions have $Q(s,a)$ values that are much higher, then they are used in the bootstrapping process for Q-learning to form the targets for Q-learning updates.


Incorrectly high Q-values for OOD actions may be used for backups, leading to accumulation of error. Figure and caption credit: Aviral Kumar.

They also have results showing that if one runs a standard off-the-shelf off-policy (not offline) RL algorithm, that simply increasing the size of the static dataset does not appear to mitigate performance issues – which suggests the need for further study.

The main contributions of their paper are: (a) theoretical analysis that carefully constraining the actions considered during Q-learning can mitigate error propagation, and (b) a resulting practical algorithm known as “Bootstrapping Error Accumulation Reduction” (BEAR). (I am pretty sure that “BEAR” is meant to be a spin on “BAIR,” which is short for Berkeley Artificial Intelligence Research.)

The BEAR algorithm is visualized below. The intuition is to ensure that the learned policy matches the support of the action distribution from the static data. In contrast, an algorithm such as BCQ focuses on distribution matching (center). This distinction is actually pretty powerful; only requiring a support match is a much weaker assumption, which enables Offline RL to more flexibly consider a wider range of actions so long as the batch of data has used those actions at some point with non-negligible probability.


Illustration of support constraint (BEAR) (right) and distribution-matching constraint (middle). Figure and caption credit: Aviral Kumar.

To enforce this in practice, BEAR uses what’s known as the Maximum Mean Discrepancy (MMD) distance between actions from the unknown behavior policy $\beta$ and the actor $\pi$. This can be estimated directly from samples. Putting everything together, their policy improvement step for actor-critic algorithms is succinctly represented by Equation 1 from the paper:

\[\pi_\phi := \max_{\pi \in \Delta_{|S|}} \mathbb{E}_{s \sim \mathcal{D}} \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ \min_{j=1,\ldots,K} \hat{Q}_j(s, a)\right] \quad \mbox{s.t.} \quad \mathbb{E}_{s \sim \mathcal{D}} \Big[ \text{MMD}(\mathcal{D}(\cdot|s), \pi(\cdot|s)) \Big] \leq \varepsilon\]

The notation is described in the paper, but just to clarify: $\mathcal{D}$ represents the static data of transitions collected by behavioral policy $\beta$, and the $j$ subscripts are from the ensemble of Q-functions used to compute a conservative estimate of Q-values. This is the less interesting aspect of the policy update as compared to the MMD constraint; in fact the BAIR Blog post doesn’t include the ensemble in the policy update. As far as I can tell, there is no ablation study that tests just using one or two Q-networks, so I wonder which of the two is more important: the ensemble of networks, or the MMD constraint?

The most closely related algorithm to BEAR is the previously-discussed BCQ (Fujimoto et al., 2019). How do they compare? The BEAR authors (Kumar et al., 2019) claim:

  • Their theory shows convergence properties under weaker assumptions, and they are able to bound the suboptimality of their approach.

  • BCQ is generally better when off-policy data is collected by an expert, but BEAR is better when data is collected by a weaker (or even random) policy. They claim this is because BCQ too aggressively constrains the distribution of actions, and this matches the interpretation of BCQ as matching the distribution of the policy of the data batch, whereas BEAR focuses on only matching the action support.

Upon reading this, I became curious to see if there’s a way to combine the strengths of both of the algorithms. I am also not entirely convinced that MuJoCo is the best way to evaluate these algorithms, so we should hopefully look at what other datasets might appear in the future so that we can perform more extensive comparisons of BEAR and BCQ.

At this point, we now consider papers that are in the second category – those which, rather than constrain actions in some way, focus on investigating what happens with a large and diverse dataset while maximizing the exploitation capacity of standard off-policy Deep RL algorithms.

An Optimistic Perspective on Offline Reinforcement Learning

Unlike the prior papers, which present algorithms to constrain the set of considered actions, this paper argues that it is not necessary to use a specialized Offline RL algorithm. Instead, use a stronger off-policy Deep RL algorithm with better exploitation capabilities. I especially enjoyed reading this paper, since it gave me insights on off-policy reinforcement learning, and the experiments are also clean and easy to understand. Surprisingly, it was rejected from ICLR 2020, and I’m a little concerned about how a paper with this many convincing experimental results can get rejected. The reviewers also asked why we should care about Offline RL, and the authors gave a rather convincing response! (Fortunately, the paper eventually found a home at ICML 2020.)

Here is a quick summary of the paper’s experiments and contributions. When discussing the paper or referring to figures, I am referencing the second version on arXiv, which corresponds to the ICLR 2020 submission and used “Batch RL” instead of “Offline RL” so we’ll use both terms interchangeably. The paper was previously titled “Striving for Simplicity in Off-Policy Deep Reinforcement Learning.”

  • To form the batch for Offline RL, they use logged data from 50M steps of standard online DQN training. In general, one step is four environment frames, so this matches the 200M frame case which is standard for Atari benchmarks. I believe the community has settled on the 1 step to 4 frame ratio. As discussed in (Machado et al., 2018), to introduce stochasticity, the agents employ sticky actions. So, given this logged data, let’s run Batch RL, where we run off-policy deep Q-learning algorithms with a 50M-sized replay buffer, and sample items uniformly.

  • They show that the off-policy, distributional-based DeepRL algorithms Categorical DQN (i.e., C51) and Quantile Regression DQN (i.e., QR-DQN), when trained solely on that logged data (i.e., in an offline setting), actually outperform online DQN!! See Figure 2 in the paper, for example. Be careful about what this claims means: C51 and QR-DQN are already known to be better than vanilla DQN, but the experiments show that even in the absence of exploration for those two methods, they still out-perform online (i.e., with exploration) DQN.

  • Incidentally, offline C51 and offline QR-DQN also out-perform offline DQN, which as expected, is usually worse than online DQN. (To be fair, Figure 2 suggests that in 10-15 out of 60 games, offline DQN can actually outperform the online variant.) Since the experiments disentangle exploration from exploitation, we can explain the difference between performance of offline DQN versus offline C51 or QR-DQN as due to exploitation capability.

  • Thus so far we have the following algorithms, from worst to best with respect to game score: offline DQN, online DQN, offline C51, and offline QR-DQN. They did not present a full result of offline C51 except for a few games in the Appendix but I’m assuming that QR-DQN would be better in both offline and online cases. In addition, I also assume that online C51 and online QR-DQN would outperform their offline variants, at least if their offline variants are trained on DQN-generated data.

  • To add further evidence that improving the base off-policy Deep RL algorithm can work well in the Batch RL setting, their results in Figure 4 suggest that using Adam as the optimizer instead of RMSprop for DQN is by itself enough to get performance gains. In that this offline DQN can even outperform online DQN on average! I’m not sure how much I can believe this result, because Adam can’t offer that much of an improvement, right?

  • They also experiment with a continuous control variant, using 1M samples from a logged training run of DDPG. They apply Batch-Constrained Q-learning from (Fujimoto et al., 2019) as discussed above, and find that it performs reasonably well. But they also find that they can simply use Twin-Delayed DDPG (i.e., TD3) from (Fujimoto et al., 2018) (yes, the same guy!) and train normally in an off-policy fashion to get better results than offline DDPG. Since TD3 is known as a stronger off-policy continuous control deep Q-learning algorithm than DDPG, this further bolsters the paper’s claims that all we need is a stronger off-policy algorithm for effective Batch RL.

  • Finally, from the above observations, they propose their Random Ensemble Mixture (REM) algorithm, which uses an ensemble of Q-networks and enforces Bellman consistency among random convex combinations. This is similar to how Dropout works. There are offline and online versions of it. In the offline setting, REM outperforms C51 and QR-DQN despite being simpler. By “simpler” the authors mainly refer to not needing to estimate a full distribution of the value function for a given state, as distributional methods do.

That’s not all they did. In an older version of the paper, they also tried experiments with logged data from a training run of QR-DQN. However, the lead author told me he removed those results since there were too many experiments which were confusing readers. In addition, for logged data from training QR-DQN, it is necessary to train an even stronger off-policy Deep RL algorithm to out-perform the online QR-DQN algorithm. I have to admit, sometimes I was also losing track of all the experiments being run in this paper.

Here is a handy visualization of some algorithms involved in the paper: DQN, QR-DQN, Ensemble-DQN (their baseline) and REM (their algorithm):


My biggest takeaway from reading this paper is that in Offline RL, the quality of the data matters significantly, and it is better to use data from many different policies rather than one fixed policy. That they get logged data from a training run means that, literally, every four steps, there was a gradient update to the policy parameters and thus a change to the policy itself. This induces great diversity in the data for Offline RL. Indeed, (Fujimoto et al., 2019) argues that the success of REM and off-policy algorithms more generally depends on the training data composition. Thus, it is not generally correct to think of these papers contradicting each other; they are more accurately thought of as different ways to achieve the same goal. Perhaps the better way going forward is simply to use larger and larger datasets with strong off-policy algorithms, while also perhaps specializing those off-policy algorithms for the batch setting.

IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

This paper proposes the algorithm IRIS: Implicit Reinforcement without Interaction at Scale. It is specialized for offline learning from large-scale robotics datasets, where the demonstrations may be either suboptimal or highly multi-modal. The algorithm is motivated by the same off-policy, Batch RL considerations as other papers I discuss here, and I found this paper because it cited a bunch of them. Their algorithm is visualized below:


To summarize:

  • IRIS splits control into “high-level” and “low-level” controllers. The high-level mechanism, at a given state $s_t$, must pick a new goal state $s_g$. Then, the low-level mechanism is conditioned on that goal state, and produces the actual actions $a \sim \pi_{im} (s_t | s_g)$ to take.

  • The high-level policy is split in two parts. The first samples several goal proposals. The second picks the best goal proposal to pass to the low-level controller.

  • The low-level controller, given the goal $s_g$, takes $T$ actions conditioned on that goal. Then, it returns control to the high level policy, which re-samples the goal state.

  • The episode terminates when the agent gets sufficiently close to the true goal state. This is a continuous state domain, so they simply pick a distance threshold to the state. They are also in the sparse reward domain, adding another challenge.

How are the components trained?

  • The first part of the high-level controller uses a goal conditional Variational AutoEncoder (cVAE). Given a sequence of states in the data, IRIS samples pairs that are $T$ time steps apart, i.e., $(s_t, s_{t+T})$. The encoder $E(s_{t},s_{t+T})$ maps the tuple to a set of latent variables for a Gaussian, i.e., $\mu, \sigma =E(s_{t},s_{t+T})$. The decoder must construct the future state: $\hat{s}_{t+T} \sim D(s_t, z)$ where $z$ is a Gaussian sampled from $\mu$ and $\sigma$. This is for training; for test time, they sample $z$ from a standard normal $z \sim \mathcal{N}(0,1)$ (with regularization during training) and pass it to the decoder, so that it produces goal states.

  • The second part uses an action cVAE as part of their simpler variant of Batch Constrained Deep Q-learning (discussed at the beginning of this blog post) for the value function in the high-level controller. This cVAE, rather than predicting goals, will predict actions conditioned on a state. This can be trained by sampling state-action pairs $(s_t,a_t)$ and having the cVAE predict $a_t$. They can then use it in their BCQ algorithm because the cVAE will model actions that are more likely to be part of the training data.

  • The low-level controller is a recurrent neural network that, given $s_t$ and $s_g$, produces $a_t$. It is trained with behavior cloning, and therefore does not use Batch RL. But, how does one get the goal? It’s simple: since IRIS assumes the low-level controller runs for a fixed number of steps (i.e., $T$ steps) then they take consecutive state-action sequences of length $T$ and then treat the last state as the goal. Intuitively, the low-level controller trained this way will be able to figure out how to get from a start state to a “goal” state in $T$ steps, where “goal” is in quotes because it is not a true environment goal but one which we artificially set for training. This reminds me of Hindsight Experience Replay, which I have previously dissected.

Some other considerations:

  • They argue that IRIS is able to handle diverse solutions because the goal cVAE can sample different goals, to explicitly take diversity into account. Meanwhile, the low-level controller only has to model short-horizon goals at a time “resolution” that does not easily permit many solutions.

  • They argue that IRIS can handle off-policy data because their BCQ will limit actions to those likely to be generated by the data, and hence the value function (which is used to select the goal) will be more accurate.

  • They split IRIS into higher and lower level controllers because in theory this may help to handle for suboptimal demonstrations — the high-level controller can pick high value goals, and the low-level controller just has to get from point A to point B. This is also pretty much why people like hierarchies in general.

Their use of Batch RL is interesting. Rather than using it to train a policy, they are only using it to train a value function. Thus, this application can be viewed as similar to papers that are concerned with off-policy RL but only for the purpose of evaluating states. Also, why do they argue their variant of BCQ is simpler? I think it is because they eschew from training a perturbation model, which was used to optimally perturb the actions that are used for candidates. They also don’t seem to use a twin critic.

They evaluate IRIS on three datasets. Two use their prior work, RoboTurk. You can see an overview on the Stanford AI Blog here. I have not used RoboTurk before so it may be hard for me to interpret their results.

  • Graph Reach: they use a simple 2D navigation example, which is somewhat artificial but allows for easy testing of multi-modal and suboptimal examples. Navigation tasks are also present in other papers that test for suboptimal demonstrations, such as SAVED from Ken Goldberg’s lab.

  • Robosuite Lift: this involves the Robosuite Lift data, where a single human performed teleoperation (in simulation) using RoboTurk, to lift an object. The human intentionally used suboptiomal demonstrations..

  • RoboTurk Can Pick and Place: now they use a pick-and-place task, this time using RoboTurk to get a diverse set of samples due to using different human operators. You can see an overview on the Stanford AI Blog here. Again, I have not used Roboturk, but it appears that this is the most “natural” of the environments tested.

Their experiments benchmark against BCQ, which is a reasonable baseline.

Overall, I think this paper has a mix of both the “action constraining” algorithms discussed in this blog post, and the “learning from large scale datasets” papers. It was the first to show that offline RL could be used as part of the process for robot manipulation. Another project that did something similar, this time with physical robots, is from DeepMind, to which we now turn.

Scaling Data-driven Robotics with Reward Sketching and Batch Reinforcement Learning

This recent DeepMind paper is the third one I discuss which highlights the benefits of a large, massive offline dataset (which they call “NeverEnding Storage”) coupled with a strong off-policy reinforcement learning algorithm. It shows what is possible when combining ideas from reinforcement learning, human-computer interaction, and database systems. The approach consists of five major steps, as nicely indicated by the figure:


In more detail, they are:

  1. Get demonstrations. This can be from a variety of sources: human teleoperation, scripted policies, or trained policies. At first, the data is from human demonstrations or scripted policies. But, as robots continue to train and perform tasks, their own trajectories are added to the NeverEnding Storage. Incidentally, this paper considers the multi-task setup, so the policies act on a variety of tasks, each of which has its own starting conditions, particular reward, etc.

  2. Reward sketching. A subset of the data points are selected for humans to indicate rewards. Since it involves human intervention, and because reward design is fiendishly difficult, this part must be done with care, and certainly cannot be done by having humans slowly and manually assign a number to every frame. (I nearly vomit when simply thinking about doing that.) The authors cleverly engineered a GUI where a human can literally sketch a reward, hence the name reward sketching, to seamlessly get rewards (between 0 and 1) for each frame.

  3. Learning the reward. The system trains a reward function neural network $r_\psi$ to predict task-specific (dense) rewards from frames (i.e., images). Rather than regress directly on the sketched values, the proposed approach involves taking two frames $x_t$ and $x_q$ within the same episode, and enforcing consistency conditions with the reward functions via hinge losses. Clever! When the reward function updates, this can trigger retroactive re-labeling of rewards per time step in the NES.

  4. Batch RL. A specialized Batch RL algorithm is not necessary because of the massive diversity of the offline dataset, though they do seem to train task-specific policies. They use a version of D4PG, short for “Distributed Distributional Deep Deterministic Policy Gradients” which is … a really good off-policy RL algorithm! Since the NES contains data from many tasks, if they are trying to optimize the learned reward for a task, they will draw 75% of the minibatch from all of the NES, and draw the remaining 25% from task-specific episodes. I instantly made the connection to DeepMind’s “Observe and Look Further (arXiv 2018)” paper (see my blog post here) which implements a 75-25 minibatch ratio among demonstrator and agent samples.

  5. Evaluation. Periodically evaluate the robot and add new demonstrations to NES. Their experiments consist of a Sawyer robot facing a 35 x 35 cm basket of objects, and the tasks generally involve grasping objects or stacking blocks.

  6. Go back to step (1) and repeat, resulting in over 400 hours of video data.

There is human-in-the-loop involved, but they argue (reasonably, I would add) that reward sketching is a relatively simple way of incorporating humans. Furthermore, while human demonstrations are necessary, those are ideally drawn from existing datasets.

They say they will release their dataset so that it can facilitate development of subsequent Batch RL algorithms, though my impression is that we might as well deploy D4PG, so I am not sure if this will spur more Batch RL algorithms. On a related note, if you are like me and have trouble following all of the “D”s in the algorithm and all of DeepMind’s “state of the art” reinforcement learning algorithms, DeepMind has a March 31 blog post summarizing the progression of algorithms on Atari. I wish we had something similar for continuous control, though.

Here are some comparisons between this and the ones from (Agarwal et al., 2020) and (Mandlekar et al., 2020) previously discussed:

  • All papers deal with Batch RL from a large set of robotics-related data, though the datasets themselves differ: Atari versus RoboTurk versus this new dataset, which will hopefully be publicly available. This paper appears to be the only one capable of training Batch RL policies to perform well on new tasks. The analogue for Atari would be training a Batch RL agent on several games, and then applying it (or fine-tuning it) to a new Atari game, but I don’t think this has been done.

  • This paper agrees with the conclusions of (Agarwal et al., 2020) that having a sufficiently large and diverse dataset is critical to the success of Offline RL.

  • This paper uses D4PG as a very powerful, offline RL algorithm for learning policies, whereas (Agarwal et al., 2020) proposes a simpler version of Quantile-Regression DQN for discrete control, and (Mandlekar et al., 2020) only use Batch RL to train a value function instead of a policy.

  • This paper proposes the novel reward sketching idea, whereas (Agarwal et al., 2020) only use environments that give dense rewards, and (Mandlekar et al., 2020) use environments with sparse rewards that indicate task success.

  • This paper does not factorize policies into lower and higher level controllers, unlike (Mandkelar et al., 2020), though I assume in principle it is possible to merge the ideas.

In addition to the above comparisons, I am curious about the relationship between this paper and RoboNet from CoRL 2019. It seems like both projects are motivated by developing large datasets for robotics research, though the latter may be more specialized to visual foresight methods, but take my judgment with a grain of salt.

Overall, I have hope that, with disk space getting cheaper and cheaper, we will eventually have robots deployed in fleets that can draw upon this storage in some way.

Concluding Remarks and References

What are some of the common themes or thoughts I had when reading these and related papers? Here are a few:

  • When reading these papers, take careful note as to whether the data is generated from a non-stationary or a stationary policy. Furthermore, how diverse is the dataset?

  • The “data diversity” and “action constraining” aspects of this literature may be complementary, but I am not sure if anyone has shown how well those two mix.

  • As I mention in my blog posts, it is essential to figure out ways that an imitator can outperform the expert. While this has been demonstrated with algorithms that combine RL and IL with exploration, the Offline RL setting imposes extra constraints. If RL is indeed powerful enough, maybe it is still able to outperform the demonstrator in this setting. Thus, when developing algorithms for Offline RL, merely meeting the demonstrator behavior is not sufficient.

Happy offline reinforcement learning!



Here is a full listing of the papers covered in this blog post, in order of when I introduced the paper.

Finally, here are another set of Offline RL or related references that I didn’t have time to cover, but I will likely modify this post in the future, especially given that I already have summary notes to myself on most of these papers (but they are not yet polished enough to post on this blog).

There is also extensive literature on off-policy evaluation, without necessarily focusing on policy optimization or deploying learned policies in practice. I did not focus on these as much since I wanted to discuss work that trains policies in this post.

I hope this post was helpful! As always, thank you for reading.










Getting Started with Blender for Robotics

Jun 22, 2020

Blender is a popular open-source computer graphics software toolkit. Most of its users probably use it for its animation capabilities, and it’s often compared to commercial animation software such as Autodesk Maya and Autodesk 3ds Max. Over the last one and a half years, I have used Blender’s animation capabilities for my ongoing robotics and artificial intelligence research. With Blender, I can programmatically generate many simulated images which then form the training dataset for deep neural network robotic policies. Since implementing domain randomization is simple in Blender, I can additionally perform Sim-to-Real transfer. In this blog post, and hopefully several more, I hope to demonstrate how to get started with Blender, and more broadly to make the case for Blender in AI research.

As of today’s writing, the latest version is Blender 2.83, which one can download from its website for Windows, Mac, or Linux. I use the Mac version on my laptop for local tests and the Linux version for large-scale experiments on servers. When watching older videos of Blender or borrowing related code, be aware that there was a significant jump between Blender 2.79 and Blender 2.80. By comparison, the gap between versions 2.80 to 2.83 is minor.

Installing Blender is usually straightforward. On Linux systems, I use wget to grab the file online from the list of releases here. Suppose one wants to use version 2.82a, which is the one I use these days. Simply scroll to the appropriate release, right-click the desired file, and copy the link. I then paste it after wget and run the command:

wget https://download.blender.org/release/Blender2.82/blender-2.82a-linux64.tar.xz

This should result in a *.tar.xz file, which for me was 129M. Next, run:

tar xvf blender-2.82a-linux64.tar.xz

The v is optional and is just for verbosity. To check the installation, cd into the resulting Blender directory and type ./blender --version. In practice, I recommend setting an alias in the ~/.bashrc like this:

export PATH=${HOME}/blender-2.82a-linux64:$PATH

which assumes I un-tarred it in my home directory. The process for installing on a Mac is similar. This way, when typing in blender, the software will open up and produce this viewer:

blender_1

The starting cube shown above is standard in default Blender scenes. There’s a lot to process here, and there’s a temptation to check out all the icons to see all the options available. I recommend resisting this temptation because there’s way too much information. I personally got started with Blender by watching this set of official YouTube video tutorials. (The vast majority have automatic captions that work well enough, but a few strangely have captions in different languages despite how the audio is clearly in English.) I believe these are endorsed by the developers, or even provided by them, which attests to the quality of its maintainers and/or community. The quality of the videos is outstanding: they cover just enough detail, provide all the keystrokes used to help users reproduce the setup, and show common errors.

For my use case, one of the most important parts of Blender is its scripting capability. Blender is tightly intertwined with Python, in the sense that I can create a Python script and run it, and Blender will run through the steps in the script as if I had performed the equivalent manual clicks in the viewer. Let’s see a brief example of how this works in action, because over the course of my research, I often have found myself adding things manually in Blender’s viewer, then fetching the corresponding Python commands to be used for scripting later.

Let’s suppose we want to create a cloth that starts above the cube and falls on it. We can do this manually based on this excellent tutorial on cloth simulation. Inside Blender, I manually created a “plane” object, moved it above the cube, and sub-divided it by 15 to create a grid. Then, I added the cloth modifier. The result looks like this:

blender_2

But how do we reproduce this example in a script? To do that, look at the Scripting tab, and the lower left corner window in it. This will show some of the Python commands (you’ll probably need to zoom in):

blender_3

Unfortunately, there’s not always a perfect correspondence of the commands here and the commands that one has to actually put in a script to reproduce the scene. Usually there are commands missing from the Scripting tab that I need to include in my actual scripts in order to get them working properly. Conversely, some of the commands in the Scripting tab are irrelevant. I have yet to figure out a hard and fast rule, and rely on a combination of the Scripting tab, borrowing from older working scripts, and Googling stuff with “Blender Python” in my search commands.

From the above, I then created the following basic script:

# Must be imported to use much of Blender's functionality.
import bpy

# Add collision modifier to the cube (selected by default).
bpy.ops.object.modifier_add(type='COLLISION')

# Add a primitive plane (makes this the selected object). Add the translation
# method into the location to start above the cube.
bpy.ops.mesh.primitive_plane_add(size=2, enter_editmode=False, location=(0, 0, 1.932))

# Rescale the plane. (Could have alternatively adjusted the `size` above.)
# Ignore the other arguments because they are defaults.
bpy.ops.transform.resize(value=(1.884, 1.884, 1.884))

# Enter edit-mode to sub-divide the plane and to add the cloth modifier.
bpy.ops.object.editmode_toggle()
bpy.ops.mesh.subdivide(number_cuts=15)
bpy.ops.object.modifier_add(type='CLOTH')

# Go back to "object mode" by re-toggling edit mode.
bpy.ops.object.editmode_toggle()

If this Python file is called test-cloth.py then running blender -P test-cloth.py will reproduce the setup. Clicking the “play” button at the bottom results in the following after 28 frames:

blender_4

Nice, is it? The cloth is “blocky” here, but there are modifiers that can and will make it smoother.

The Python command does not need to be done in a “virtualenv” because Blender uses its own Python. Please see this Blender StackExchange post for further details.

There’s obviously far more to Blender scripting, and I am only able to scratch the surface in this post. To give an idea of its capabilities, I have used Blender for the following three papers:

The first two papers used Blender 2.79, whereas the third used Blender 2.80. The first two used Blender solely for generating (domain-randomized) images from cloth meshes imported from external software, whereas the third created cloth directly in Blender and used the software’s simulator.

In subsequent posts, I hope to focus more on Python scripting and the cloth simulator in Blender. I also want to review Blender’s strengths and weaknesses. For example, there are good reasons why the first two papers above did not use Blender’s built-in cloth simulator.

I hope this served as a concise introduction to Blender. As always, thank you for reading.