# DAgger Versus SafeDAgger

The seminal DAgger paper from AISTATS 2011 has had a tremendous impact on machine learning, imitation learning, and robotics. In contrast to the vanilla supervised learning approach to imitation learning, DAgger proposes to use a supervisor to provide corrective labels to counter compounding errors. Part of this BAIR Blog post has a high-level overview of the issues surrounding compounding errors (or “covariate shift”), and describes DAgger as an on-policy approach to imitation learning. DAgger itself — short for Dataset Aggregation — is super simple and looks like this:

• Train $\pi_\theta(\mathbf{a}_t \mid \mathbf{o}_t)$ from demonstrator data $\mathcal{D} = \{\mathbf{o}_1, \mathbf{a}_1, \ldots, \mathbf{o}_N, \mathbf{a}_N\}$.
• Run $\pi_\theta(\mathbf{a}_t \mid \mathbf{o}_t)$ to get an on-policy dataset $\mathcal{D}_\pi = \{\mathbf{o}_1, \ldots, \mathbf{o}_M\}$.
• Ask a demonstrator to label $\mathcal{D}_\pi$ with actions $\mathbf{a}_t$.
• Aggregate $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_{\pi}$ and train again.

with the notation borrowed from Berkeley’s DeepRL course. The training step is usually done via standard supervised learning. The original DAgger paper includes a hyperparameter $\beta$ so that the on-policy data is actually generated with a mixture:

$\pi = \beta \pi_{\rm supervisor} + (1-\beta) \pi_{\rm agent}$

but in practice I set $\beta=0$, which in this case means all states are generated from the learner agent, and then subsequently labeled from the supervisor.

DAgger is attractive not only in practice but also in terms of theory. The analysis of DAgger relies on mathematical ingredients from regret analysis and online learning, as hinted by the paper title: “A Reduction of Imitation learning and Structured Prediction to No-Regret Online Learning.” You can find some relevant theory in (Kakade and Tewari, NeurIPS 2009).

## The Dark Side of DAgger

Now that I have started getting used to reading and reviewing papers in my field, I can more easily understand tradeoffs in algorithms. So, while DAgger is a conceptually simple and effective method, what are its downsides?

• We have to request the supervisor for labels.
• This has to be done for each state the agent encounters when taking steps in an environment.

Practitioners can mitigate these by using a simulated demonstrator, as I have done in some of my robot fabric manipulation work. In fact, I’m guessing this is the norm in machine learning research papers that use DAgger. This is not always feasible, however, and even with a simulated demonstrator, there are advantages to querying less often.

Keeping within the DAgger framework, an obvious solution would be to only request labels for a subset of data points. That’s precisely what the SafeDAgger algorithm, proposed by Zhang and Cho, and presented at AAAI 2017, intends to accomplish. Thus, let’s understand how SafeDAgger works. In the subsequent discussion, I will (generally) use the notation from the SafeDAgger paper.

## SafeDAgger

The SafeDAgger paper has a nice high-level summary:

In this paper, we propose a query-efficient extension of the DAgger, called SafeDAgger. We first introduce a safety policy that learns to predict the error made by a primary policy without querying a reference policy. This safety policy is incorporated into the DAgger’s iterations in order to select only a small subset of training examples that are collected by a primary policy. This subset selection significantly reduces the number of queries to a reference policy.

Here is the algorithm:

SafeDAgger uses a primary policy $\pi$ and a reference policy $\pi^*$, and introduces a third policy $\pi_{\rm safe}$, known as the safety policy, which takes in the observation of the state $\phi(s)$ and must determine whether the primary policy $\pi$ is likely to deviate from a reference policy $\pi^*$ at $\phi(s)$.

A quick side note: I often treat “states” $s$ and “observations” $\phi(s)$ (or $\mathbf{o}$ in my preferred notation) interchangeably, but keep in mind that these technically refer to different concepts. The “reference” policy is also often referred to as a “supervisor,” “demonstrator,” “expert,” or “teacher.”

A very important fact, which the paper (to its credit) repeatedly accentuates, is that because $\pi_{\rm safe}$ is called at each time step to determine if the reference must be queried, $\pi_{\rm safe}$ cannot query $\pi^*$. Otherwise, there’s no benefit — one might as well dispense with $\pi_{\rm safe}$ all together and query $\pi^*$ normally for all data points.

The deviation $\epsilon$ is defined with the $L_2$ distance:

$\epsilon(\pi, \pi^*, \phi(s)) = \| \pi(\phi(s)) - \pi^*(\phi(s)) \|_2^2$

since actions in this case are in continuous land. The optimal safety policy $\pi_{\rm safe}^*$ is:

$\pi_{\rm safe}^*(\pi, \phi(s)) = \begin{cases} 0, \quad \mbox{if}\; \epsilon(\pi, \pi^*, \phi(s)) > \tau \\ 1, \quad \mbox{otherwise} \end{cases}$

where the cutoff $\tau$ is user-determined.

The real question now is how to train $\pi_{\rm safe}$ from data $D = \{ \phi(s)_1, \ldots, \phi(s)_N \}$. The training uses the binary cross entropy loss, where the label is “are the two policies taking sufficiently different actions”? For a given dataset $D$, the loss is:

\begin{align} l_{\rm safe}(\pi_{\rm safe}, \pi, \pi^*, D) &= - \frac{1}{N} \sum_{n=1}^{N} \pi_{\rm safe}^*(\phi(s)_n) \log \pi_{\rm safe}(\phi(s)_n, \pi) + \\ & (1 - \pi_{\rm safe}^*(\phi(s)_n)) \log(1 - \pi_{\rm safe}(\phi(s)_n, \pi)) \end{align}

again, here, $\pi_{\rm safe}^*$ and $(1-\pi_{\rm safe}^*)$ represent ground-truth labels for the cross entropy loss. It’s a bit tricky; the label isn’t something inherent in a training data, but something SafeDAgger artificially enforces to get desired behavior.

Now let’s discuss the control flow of SafeDAgger. The agent collects data by following a safety strategy. Here’s how it works: at every time step, if $\pi_{\rm safe}(\pi, \phi(s)) = 1$, let the usual agent take actions. Otherwise, $\pi_{\rm safe}(\pi, \phi(s)) = 0$ (remember, this function is binary) and the reference policy takes actions. Since this is done at each time step, the reference policy can return control to the agent as soon as it is back into a “safe” state with low action discrepancy.

Also, when the reference policy takes actions, these are the data points that get labeled to produce a subset of data $D’$ that form the input to $l_{\rm safe}$. Hence, the process of deciding which subset of states should be used to query the reference happens during environment interaction time, and is not a post-processing event.

Training happens in lines 9 and 10 of the algorithm, which updates not only the agent $\pi$, but also the safety policy $\pi_{\rm safe}$.

Actually, it’s somewhat strange why the safety policy should help out. If you notice, the algorithm will continually add new data to existing datasets, so while $D_{\rm safe}$ initially produces a vastly different dataset for $\pi_{\rm safe}$ training, in the limit, $\pi$ and $\pi_{\rm safe}$ will be trained on the same dataset. Line 9, which trains $\pi$, will make it so that for all $\phi(s) \in D$, we have $\pi(\phi(s)) \approx \pi^*(\phi(s))$. Then, line 10 trains $\pi_{\rm safe}$ … but if the training in the previous step worked, then the discrepancies should all be small, and hence it’s unclear why we need a threshold if we know that all observations in the data result in similar actions between $\pi$ and $\pi^*$. In some sense $\pi_{\rm safe}$ is learning a support constraint, but it would not be seeing any negative samples. It is somewhat of a philosophical mystery.

Experiments. The paper uses the driving simulator TORCS with a scripted demonstrator. (I have very limited experience with TORCS from an ICRA 2019 paper.)

• They use 10 tracks, with 7 for training and 3 for testing. The test tracks are only used to evaluate the learned policy (called “primary” in the paper).

• Using a histogram of squared errors in the data, they decide on $\tau = 0.0025$ as the threshold so that 20 percent of initial training samples are considered “unsafe.”

• They report damage per lap as a way to measure policy safety, and argue that policies trained with SafeDAgger converge to a perfect, no-damage policy faster than vanilla DAgger. I’m having a hard time reading the plots, though — their “SafeDAgger-Safe” curve in Figure 2 appears to be perfect from the beginning.

• Experiments also suggest that as the number of DAgger iterations increases, the proportion of time driven by the reference policy decreases.

• First, SafeDAgger is a broadly applicable algorithm. It is not specific to driving, and it should be feasible to apply to other imitation learning problems.

• Second, the cost is the same for each data point. This is certainly not the case in real life scenarios. Consider context switching: one can request the reference for help in time steps 1, 3, 5, 7, and 9, or it can request the reference for help in times 3, 4, 5, 6, and 7. Both require the same raw number of references, but it seems intuitive in some way that given a fixed budget of time, a reference policy should want a contiguous time step.

• Finally, one downside strictly from a scientific perspective is that there are no other baseline methods tested other than vanilla DAgger. I wonder if it would be feasible to compare SafeDAgger with an approach such as SHIV from ICRA 2016.

## Conclusion

To recap: SafeDAgger follows the DAgger framework, and attempts to reduce the number of queries to the reference/supervisor policy. SafeDAgger predicts the discrepancy among the learner and supervisor. Those states with high discrepancy are those which get queried (i.e., labeled) and used in training.

There’s been a significant amount of follow-up work on DAgger. If I am thinking about trying to reduce supervisor burden, then SafeDAgger is among the methods that come to my mind. Similar algorithms may get increasingly used in machine learning if DAgger-style methods become more pervasive in machine learning research, and in real life.

# How I Made My IROS 2020 Conference Presentation Video

This is my official video presentation for IROS 2020.

The 2020 International Conference on Intelligent Robots and Systems (IROS) will be virtual. It was planned to be in Las Vegas, Nevada, from October 25-29. While this was unfortunately expected, I understand the need to reduce large gatherings, as the pandemic is still happening here. I wish our government, and private citizens, could look around the world and see where things are going right regarding COVID-19; for example, Taiwan is having 10,000 person concerts and has all of seven recorded deaths as of today, while the United States still has heavy restrictions on in-person gatherings with well over 200,000 deaths (here’s the source I’ve been checking to track this information).

For IROS 2020, I am presenting a paper on robot fabric manipulation, done in collaboration with wonderful colleagues from Berkeley and Honda Research Institute. IROS 2020 asked us to create a 15-minute video for each paper, and my final product is shown above and also available on my YouTube channel. This is by far the longest pre-recorded video I have ever made for a conference. I believe it’s also my first video with audio. Normally, my research videos are just a handful of minutes long, and if I need to clarify things in the video, I add text (subtitles) manually in the iMovie application. For my IROS video, however, I wanted to make the video longer with audio, but I also knew I needed a more scalable way to add subtitles, which would be necessary for me to completely understand the video if I were to re-watch it many years later. I also wanted to add subtitles and to make them unavoidably visible to encourage other researchers to add subtitles to their videos.

Here is the backstory of how I made this video.

First, as part of my research that turned into this paper, I had many short video clips of a robot manipulating fabric in iMovie on my MacBook Pro laptop. I started a fresh iMovie file, and picked the robot videos that I wanted to include.

Then, I created a new Google Slides and a new Google Doc. In the Google Slides file, I created the slides that I wanted to show in the final video. These slides were mostly copied and pasted from earlier, internal research presentations, and reformatted to a consistent font and size style.

In the Google Doc, I wrote down my entire transcript, which turned out to be slightly over four pages. I then practiced my audio by stating what I wrote on the transcript, peppered with my usual enthusiasm. I also tried to avoid talking too fast. I used the voice Memos app on my iPhone to record audio. I made multiple audio files, each about one minute long. This made it simpler to redo any audio (which I had to do frequently) since I only had to redo small portions instead of the entire video’s audio.

Once I felt like the slides were ready, and that they aligned well with the audio, I put in each slide and audio file into iMovie, carefully adjusting the time ranges to align them, and to make sure the video did not exceed the 15-minute limit. I made further edits and improvements to the video after getting feedback from my colleagues. When I was sufficiently satisfied with the result, I saved and got an .mp4 video file.

iMovie contains functionality for adding subtitles, but the process is manual and highly cumbersome. After some research, I found this video tutorial which demonstrates how to use Kapwing to add subtitles. Kapwing is entirely web-based, so there’s no need to download it locally – I can upload videos to their website and edit in a web browser.

I can add subtitles to Kapwing by uploading audio files, and Kapwing will use automatic speech recognition to generate an initial draft, which I then fine-tune. Here is the interface for adding subtitles:

I paid 20 USD for a monthly subscription so that I could create a longer video, and followed the tutorial mentioned earlier to add subtitles. Eventually, I got my 15-minute video, which just barely fit under the 50MB file limit as mandated by IROS. I uploaded it to the conference, as well as to YouTube, which is the one at the top of this post.

I am happy with the final video product. That said, the process of adding subtitles was not ideal:

• The automatic speech recognition for producing an initial guess at the subtitles is … bad. I mean, really bad. I guess it got less than 5% of my audio correct, so in practice I was adding all of my subtitles by manually copying and pasting from my Google Doc. To put things in perspective, Google Meet (my go-to video conferencing tool these days) handles my audio far better, with subtitles that are remarkably highly quality.

• The interface for subtitles is also cumbersome to use, though to be fair, it’s an improvement over iMovie. As shown in the screenshot above, when re-editing a video, it doesn’t seem to preserve the ordering of the subtitles (notice how my first line in the video is listed second above). Furthermore, when editing and then clicking “Done”, I sometimes saw subtitles with incorrect sizes, so I had to re-edit the video … only to see a few subtitles disappear each time I did this. There also did not seem to be a way to change the subtitle size for all subtitles simultaneously. My solution was to forget about saving in progress, and to painstakingly go through each subtitle to change the size by manually clicking via a drop-down menu.

I hope this was useful! It is likely that future conferences will continue to be virtual in some way. For example, I am attempting to submit several papers to ICRA 2021, which will be in Xi’an, China, next summer. The website says ICRA 2021 will be a hybrid event with a mix of virtual and in-person events, but I would bet that many travel restrictions will still be in place, particularly for researchers from the United States. For that, and several other reasons, I am almost certainly going to be a virtual attendee, so I may need to revisit these instructions when making additional video recordings.

As always, thank you for reading, stay safe, and wear a mask.

# The Virtual 2020 Robotics: Science and Systems Conference

I have attended eight international academic conferences that contain refereed paper proceedings. My situation is much different nowadays as compared to the middle of 2016, when I had attended zero academic conferences, thought my research career was going nowhere, and that I would leave Berkeley without a PhD.

The most recent conference I attended was also my first virtual one, the 2020 Robotics: Science and Systems, one of the world’s premier robotics conferences. It occurred last month, and in keeping up with the tradition of my blog, I will briefly discuss what happened while adding some thoughts about virtual conferences.

## RSS 2020: The Workshop Days

RSS 2020 consisted of five days, with the first two dedicated to workshops. For the two workshop days, I checked the schedule in advance and decided on one workshop per day to attend, to avoid overextending my attention span and to keep my schedule manageable. For the first day, I attended the 2nd Workshop on Closing the Reality Gap in Sim2Real Transfer for Robotics. It was a fun one: much of it consisted of pre-recorded, two-on-two debates addressing controversial statements:

• “Investing into Sim2Real is a waste of time and money”
• “Sim2Real is old news. It’s just X (X=model-based RL, X=domain randomization, X=system identification)”
• “Sim2Real requires highly accurate physical simulators and photorealistic rendering”

I am a huge fan of Sim2Real, and several of my papers use the technique, so I was especially galled by the first claim. Surely it can’t possibly be a waste of time and money? (Full disclosure: one of my PhD advisors agrees with me and was in the debate arguing against the first claim, but I still would hold that belief even if he was not involved. You’ll have to take my word on that.) Going through the debate was enjoyable – despite my opposition to the statement, I appreciated the perspectives of the two CMU professors, Abhinav Gupta and Chris Atkeson, arguing in favor of the claim. While researching the academic publications of those professors, I found Chris Atkeson’s impressive and persuasive 100-page paper providing his advice for starting graduate students, which features some Sim2Real discussion.

Rather than try to further describe my messy notes, I will refer you to my former colleague (and now CMU PhD student) Jacky Liang, who wrote a nice summary of the workshop. I am still going to be doing some Sim2Real work in the near future, and particularly nowadays due to the pandemic limiting access to physical robots, an obvious point that was somehow only articulated at the end of the workshop, by Berkeley Professor Anca Dragan.

For the next day, I attended the workshop on Self-Supervised Robot Learning. This was a four-hour workshop, and one that was more traditional in the sense that it was a series of longer talks by professors and research scientists, with shorter “lightning talks” by authors of accepted workshop papers. I chose to attend this because I am very interested in the topic, and think getting automatic supervision without tedious, manual labeling is key for scaling up robots to the real world. Here’s a relevant blog post of mine if you would like to read more.

There were seven speakers, and I have personally spoken to six of them (all except Abhinav Gupta):

• Dieter Fox: discussed KinectFusion and self-correspondences with descriptors. I knew some of this material from reading the relevant papers, and (you guessed it) I have a relevant blog post.
• Abhinav Gupta: talked about much of newly-appointed Professor Lerrel Pinto’s work in scaling up robot learning. I have read almost all of Lerrel Pinto’s early papers and was pleased to see them resurface here.
• Pierre Sermanet: discussed his “learning from play” papers which involve planning and learning from language. It’s fascinating stuff, and I have his papers on my “to read” list.
• Roberto Calandra: provided a series of “lessons learned” in doing robot learning research, and commented about how COVID-19 might mandate more self-supervised robots that can run on their own.
• Chelsea Finn: presented a chronology over the last 5 years about how we acquire data for robot learning, and how we can make this scale up. Critically, we need to broaden the training data distribution to cover more test-time scenarios.
• Pieter Abbeel: presented the CURL and RAD papers which suggest that learning from pixels can be as efficient as learning from state. I have read the papers in some detail, and helped with formatting the recent BAIR blog post about CURL and RAD.
• Andy Zeng: provided his thoughts on the “object-ness” assumption in robot learning, and how he was able to get automatic labels for his papers. I described some of his great work in this blog post. I am also very fortunate to have him as my Google summer internship host!

It was a great workshop with great speakers.

## RSS 2020: The Conference Days

The next three days were the formal conference days. In general, the schedule was similar for each day. Each had live talks and two hours of live poster sessions of accepted papers, all happening over Zoom. RSS is still a relatively small robotics conference, with only 103 accepted papers in 2020, in contrast to ICRA and IROS which now have well over 1000 accepted papers each year. This meant that RSS was “single track,” so only one thing was formally happening at once.

After the opening talks to introduce us to virtual RSS, we had the first of the two-hour paper discussion sections. I stayed primarily in the Zoom room allocated to my paper at RSS 2020.

University of Washington Professor Byron Boots gave an “early career” talk in the afternoon, featuring online learning and regret analysis, which befits his publication list. Some of his work involves analyzing Model Predictive Control (MPC), and once again I felt relieved about my RSS paper, which used MPC. Working on that project has made it so much easier for me to understand MPC and related topics.

The second day began with a Diversity and Inclusion panel, featuring people such as Michigan Professor Chad Jenkins. I watched the discussion and thought it went well. We then had the usual two hours of paper discussions. Most Zoom rooms were almost empty, with the exception of the paper authors. Honestly, I like this, because it made it easy for me to talk with various paper authors.

The keynote talk by MIT Professor Josh Tenenbaum later that day was excellent. He’s done great work in areas that overlap with robotics, most notably computer vision and psychology, and I was thinking about how I could incorporate his findings into my research agenda.

The third day of the conference began with a discussion and a town hall. Many conferences have started these discussions, which I suspect is in large part to solicit feedback on how to make conferences more inclusive to the research community. I recall that a conference organizer mentioned that we have professional real-time captioning for all the major talks, and praised it. I agree! There was some Q & A at the end, and one thought-provoking comment came from an audience member who thought that hybrid conferences that combine virtual and in-person events would not work well. While the commenter made it clear that he/she wanted to see a hybrid event work, there is a huge risk in creating an inequity to favor people who are there in-person over those who are attending virtually. It will be interesting to see what happens with ICRA 2021, which is planned to be in Xi’an, China, next May. The ICRA 2021 website is already saying that the conference will be hybrid.

After this, we had the usual paper discussions, followed by Stanford Professor Jeannette Bohg’s excellent early career talk. The day concluded with the paper awards and the farewell talk. First, congratulations to Google for winning several awards! Second, the conference organizers said they could not provide any definitive information about where RSS would be held next year.

I read through various blog posts before attending RSS, such as one from Berkeley Professor Ben Recht and one from the organizers of ICLR 2020, which was one of the first conferences in 2020 that was forced to go virtual, so I had a rough sense of what to expect from a virtual conference. As usual, though, there’s no substitute for going through the process in person (I’m not sure if that should be a pun or not). Here are some brief thoughts:

• The conference had some virtual rooms, which I think are called “gather” sessions, for informal chats. Unfortunately, almost every time I logged into these rooms, I was the only one there. Did people make heavy use of these? On a related note, there were a few slack channels for the workshops, but I think hardly anyone used them. Maybe Slack channels should be deprioritized for smaller virtual conferences?
• Since it looks like many conferences will be virtual or hybrid going forward, perhaps we should get rid of the requirement that at least one author of each accepted paper has to physically attend the conference. Given the COVID-19 situation, and also geopolitical issues pertaining to visas and immigration, it seems like people ought to have the option to avoid travel.
• Getting a smaller conference to be time-zone friendly is a huge challenge. With a larger one like ICLR, it’s possible to have the conference run 24/7 with something happening for each time zone, but I don’t know of a good solution for one the size of RSS.
• I didn’t have the ability to set aside my entire week for the conference, since I was still interning at Google while this was happening, though I suppose I could have asked for a few days off. This meant I worked more on research than I usually do during conferences. I’m not sure if that’s a good thing or a bad thing.
• While I didn’t ask questions during the talks, I think a virtual setting makes it easier for many of us to ask questions. In a physical conference, we might have to walk to a microphone in an auditorium of thousands of people.
• As mentioned earlier, I like the smaller Zoom sessions that replace physical poster sessions. It was far easier for me to engage in substantive conversations with other researchers. In contrast, when I was at NeurIPS 2019, I could barely talk to any author given the size and the crowded, elbow-to-elbow poster sessions.
• I thought it was easier to get an academic accommodation; I requested professional captioning. For a virtual conference, it isn’t necessary to pay for someone (e.g., a sign language interpreter) to physically travel, which can increase costs.

To conclude this blog post, I want to thank the RSS organizers. I know things aren’t quite ideal, but virtual RSS went well, and I hope to attend in 2021, whether in person or virtual.

# On Anti-Racism

The last few months have taught us a lot about America. Our country is facing the twin crises of COVID-19 and racism. While the former is novel and the current crisis is in part (actually, largely) due to a lack of leadership by our top political officials, the latter is perhaps the oldest problem that stubbornly never disappears. In this post, I discuss, in order: policing (including one benign encounter with a police officer), anti-racism in academia and AI, and what I will try to do for my anti-racist education. I will discuss what I am reading, where I am donating to, and what I can commit to doing in the near future.

Policing. I was as appalled as many others from watching videos of police treatment of African-Americans in this country, especially with the George Floyd case, and I share the concerns many have over police conduct against Blacks. On the other hand, I also believe there has to be a police presence — or law enforcement more broadly — of some sort. The 1969 Murray-Hill riots in Canada, for example, where a strike by Montreal police lead to widespread lawless activity, demonstrates just how badly society depends on law enforcement, and makes me worry that the absence of police presence can lead to anarchy.

Growing up, I was told to be extra cautious around the police, and to make my hearing disability clear and upfront to any police officers to avoid misunderstandings. There have been tragic cases of deaf people being harmed and even killed by law enforcement officers who presumed a deaf person could hear and was engaging in indifference or disobedience of law enforcement commands. I know my situation is not the same as and is far milder than what many Blacks experience. While I am not white, people often think I am white by my physical appearance, so my racial composition has not been problematic.

In my life, I have been stopped by the police a grand total of zero times. Well, except for (arguably) one case where Berkeley was running a random “sobriety test” on a Friday night, and police officers were stopping every car on the street that led to my apartment. That night I wasn’t driving home from a party; I was working in the robotics lab until 9:00pm.

When it was my turn, my conversation with the police officer went like this:

Me: Hello. Nice to meet you. Just to let you know I’m deaf and may not fully understand everything you say. But I’m happy to answer any questions you have. I am curious about what is happening here.

Police officer [smiling]: Gotcha. This is a random test that we’re having to check all drivers here. In any case I don’t smell any alcohol on you, so you’re free to go.

That’s it! I have otherwise never spoken to a police officer in any driving-related context, and my few other interactions with police officers have similarly been under extraordinarily uneventful and non-threatening situations. When people such as United States Senator Tim Scott of South Carolina get stopped by the police at the Senate as he describes in this interview, then I wonder how our society can fix this.

That said, I also want to take a data-driven approach to let sober facts dictate my beliefs, rather than emotions or one-time events. Dramatic videos only show a small fraction of all police activity. Given the authority, trust, and power we give to police officers, however, the bar for their code-of-conduct should be high.

To summarize, I don’t think we should get rid of the police. I do believe we need to continue and improve training of police officers, the majority of whom do not have a college education, and to provide support (and better pay) to the good police officers while firing the bad ones. It may also be helpful if we can collectively reduce the need for police officers to deal with non-critical cases such as parking tickets and jaywalking so that they can prioritize the truly dangerous criminals. I can’t claim to be an expert on policing, so I will continue learning as much as I can about this area.

Anti-Racism in Academia and Artificial Intelligence. The Berkeley EECS department, like many similar ones in the country, is heavily dominated by Whites and Asians, and has very few Blacks, so discussion of race and racism (at least from my conversations) tend to involve the White/Asian dynamic with limited commentary about other groups.

The good news is that there’s been recent discussion about how to be anti-racist, with increased focus on Blacks. There was an email sent out by the chairs of the department which linked to statements by much of the faculty affirming their support for anti-racism. Several department-wide reading groups, email lists, and committees now exist for supporting anti-racism. A PhD student in the department, Devin Guillory, has a manuscript on combating anti-Blackness with a specific focus on the Artificial Intelligence community.

I think it’s important for the AI community to discuss the broader impacts of how our technologies can be used both for good and for bad, particularly when they can exacerbate existing disparities. One recent technology that is worth discussing is facial recognition. While I don’t do research in this area, my robotics research often uses technologies based on Deep Convolutional Neural Networks that form the bedrock for facial recognition.

Rarely is it easy to admit that one is wrong, but I think I was wrong about my initial stance on facial recognition. When I first learned about the capabilities of Convolutional Neural Networks from CS 280 at Berkeley and then the associated facial recognition literature, I dreamed of society deploying the technology to detect and catch criminals with surgical precision. (I don’t have an earlier blog post or other writing about this, so you’ll have to take my word on it.)

Since then, I’ve done almost a complete reversal and now think we should limit facial recognition research and technology, at least until we can come up with solutions that explicitly consider minority interests. Here’s why:

• I share concerns over potential inaccuracies in the technology when it pertains to racial minorities. For example, a landmark 2018 paper by Joy Buolamwini and Timnit Gebru showed that facial recognition technologies (at least at the time of publication) were far more inaccurate on people with darker skin. While the technology may have gotten more accurate on people with darker skin since it was published, a recent news article about a wrongful arrest of a black man due to facial recognition makes me anxious.

• I also worry about facial recognition being used to limit and control personal freedom. I see the extreme case of facial recognition technologies in China, where particularly in Xinjiang, they have an extensive surveillance system over the Uighur Muslims. While it’s challenging to make imperfect comparisons across different countries and governance systems, I hope that the United States does not reach this level of surveillance, and the situation there should serve as a warning sign for American residents to be wary of facial recognition systems in our own communities.

When the ACM made the following tweet a few months ago, I was heartened to see pushback by many members of the computer science community. I hope this causes the community to carefully consider the development of facial recognition technologies.

Left: a tweet the ACM sent out regarding facial recognition. (I believe this is the tweet; it's hard to find because they have deleted it.) Right: the ACM's apology.

Anti-Racism More Broadly. As mentioned earlier, as part of my broader anti-racism education, I am pursuing three separate activities which can be categorized as reading books, donating to organizations, and making commitments about my actions now and in the future.

First, in terms of books, I have been reading these in recent months:

• Evicted: Poverty and Profit in the American City by Matthew Desmond (published 2016)
• White Fragility: Why It’s So Hard for White People to Talk About Racism by Robin DiAngelo (published 2018)
• So You Want to Talk About Race? by Ijeoma Oluo (published 2018)
• Me and White Supremacy: Combat Racism, Change the World, and Become a Good Ancestor by Layla F. Saad (published 2020)
• Stamped from the Beginning: The Definitive History of Racist Ideas in America by Ibram X. Kendi (published 2016)

I finished the first four books above, and recommend all of them. I am currently working through Ibram X. Kendi’s book. I enjoy reading the books — not, of course, in the sense that racism is “enjoyable” but because I think these are well-written, well-argued books that teach me.

In addition, I also commit to increasing the number of books I read about Blacks or by Black authors. Given that I post my reading list online (see the blog archives), it should be easy to keep me accountable.

Second, I have learned more about, and have donated to, these organizations:

All relate to tech: the first for young Black women, the second for Black researchers in AI, the third for Black and Latinx in tech, the fourth for under-represented minorities more broadly, and the fifth for low-income youth. There are other loosely related organizations that I support and have donated to in the past, but I think the above are the most relevant for the current blog post context.

Third, Going forward, I will commit to anti-racism. I will not shy away from discussing this topic. I will actively help with recruitment and retainment of Blacks within my work environment. I also will avoid comments that show insensitivity in race-related contexts, including but not limited to: “playing the race card,” “I don’t see color,” “All Lives Matter,” “I am not White,” or “I have Black friends.” I also will not claim that my research is entirely disjoint from race. My robotics research is less directly race-related as compared to facial recognition research, but that is different from saying that it has nothing to do with race.

I will be careful to consider a variety of perspectives when forming my own opinions for related events. It may be the case that I believe something which most of my nearby colleagues disagree with. We don’t have to agree on everything, but I would like the academic community to avoid cases similar to how US Senator Dick Durbin smeared fellow US Senator Tim Scott (since apologized), and more generally to avoid treating Blacks as a monolithic group.

Concluding Remarks. While this blog post is coming to a close, the process of being an anti-racist will be a lifelong process. I am never going to claim perfection or that I have passed some “anti-racist threshold” and am therefore one of the “good guys.” This is a lifelong process. I will make many mistakes along the way. I may discuss more about this in some future blog posts. In the meantime, let me know if you have comments or suggestions.

# Regarding the ICE International Student Ban

I was going to email this letter to elected federal politicians, but fortunately, the U.S. Immigration and Customs Enforcement (ICE) seems to have repealed their misguided policy about forcing international students to take classes in-person. Nonetheless, here’s the letter, and in case a similar policy somehow re-emerges, I will start sending this message. This particular letter is addressed to U.S. Senator Dianne Feinstein given my California residency, her particular Senate Committee assignments, and because of the six offices I called last week, only hers had an actual human on the line for me to address my concerns. The letter is based on this template. Unfortunately I’m not sure who wrote it.

Dear Senator Dianne Feinstein,

My name is Daniel Seita. I currently reside in the San Francisco Bay Area. I am registered voter and thank you for your many years of service in the United States Senate representing California.

I am emailing to insist that you stop the recent student ban.

On July 6, 2020, the United States Immigration and Customs Enforcement announced that they will be modifying their Student and Exchange Visitor Program (SEVP) impacting F-1 and M-1 international students. Under the modified SEVP, F-1 and M-1 students with valid student visas would be forced to leave the United States if their college or university was not offering in-person classes.

International students pay the highest tuition to colleges and universities and shifting to an online-only syllabus does not reduce and shrink the economic burden of the high costs. Forcing international students to pay these high costs while also making them leave the country is unfair on many levels. Furthermore, the funds that international students bring in subsidize domestic students.

With the COVID-19 pandemic still spiking, the opening of in-person classes is unsafe and unnecessary. These new SEVP modifications force universities to choose between opening in-person classes even if it is not safe or lose their international student body who account for billions of dollars to the US economy.

International students have built lives for themselves while at school, and it is cruel to take it away. Students have signed leases and agreements, have possessions and belongings, and have loved ones and friends that they are being ripped apart from because of the unpredictable consequences of COVID-19. Many domestic students are unable to take classes in-person and it is an unfair expectation that international students who are here, legally, for school must be able to enroll in on-campus courses in order to stay in the country. With the fall semester rapidly approaching there is little time for students to transfer schools or find somewhere else to live.

The US has many of the best universities in the world, and a large part of that is due to immigration and international students. Our country has an unparalleled ability to recruit the best and brightest from all over the world, many of whom choose to stay in the country after their education. Without the contributions of international students and faculty, the quality of our education, research, and innovation would plummet.

I am a computer science PhD student at the University of California, Berkeley, and I work in artificial intelligence and robotics. I would guess that one-third of the people who I regularly collaborate with in my research are internationals. They have taught me so much about my field and have helped to raise my quality of research. Severing these collaborations will not only disrupt our research, but damage America’s global reputation.

I hope you consider these concerns and convince ICE to overturn the student ban.

Daniel Seita

Thanks to every international student and collaborator who teaches and inspires me.

# When Deep Models for Visual Foresight Don't Need to be Deep

The virtual Robotics: Science and Systems (RSS) conference will happen in about a week, and I will be presenting a paper there. This is going to be my first time at RSS, and I was hoping to go to Oregon State University and meet other researchers in person, but alas, given the rapid disintegration of America as it pertains to COVID-19, a virtual meeting makes 100 percent sense. For RSS 2020, I’ll be presenting our paper VisuoSpatial Foresight for Multi-Step, Multi-Task Fabric Manipulation, co-authored with Master’s (and soon to be PhD!) student Ryan Hoque. This is based on a technique called visual foresight, and in this blog post, I’d like to briefly touch upon the technique, and then discuss a little more about our RSS 2020 paper, along with another surprising paper which shows that perhaps we need to rethink our deep models.

First, to make sure we’re on common ground here, what do people mean when we say the words “Visual Foresight”? This refers to the technique described in an ICRA 2017 paper by Chelsea Finn and Sergey Levine, which was later expanded upon in a longer journal paper with lead authors Chelsea Finn and Frederik Ebert. The authors are (or were) at UC Berkeley, my home institution, which is one reason why I learned about the technique.

Visual Foresight is typically used in a model-based RL framework. I personally categorize model-based methods into whether the models predict images or whether they predict some latent variables (assuming, of course, that the model itself needs to be learned). Visual Foresight applies to the former case for predicting images. In practice, given the difficult nature of image prediction, this is often done by predicting translations or deltas between images. For the second case of latent variable prediction, I refer you to the impressive PlaNet research from Google.

For another perspective on model-based methods, the following text is included in OpenAI’s “Spinning Up” guide for deep reinforcement learning:

Algorithms which use a model are called model-based methods, and those that don’t are called model-free. While model-free methods forego the potential gains in sample efficiency from using a model, they tend to be easier to implement and tune. As of the time of writing this introduction (September 2018), model-free methods are more popular and have been more extensively developed and tested than model-based methods.

and later:

Unlike model-free RL, there aren’t a small number of easy-to-define clusters of methods for model-based RL: there are many orthogonal ways of using models. We’ll give a few examples, but the list is far from exhaustive. In each case, the model may either be given or learned.

I am writing this in July 2020, and I believe that since September 2018, model-based methods have made enormous strides, to the point where I’m thinking that 2018-2020 might be known as the “model-based reinforcement learning” era. Also, to comment on a point from OpenAI’s text, while model-free methods might be easier to implement in theory, I argue that model-based methods can be far easier to debug, because we can check the predictions of the learned model. In fact, that’s one of the reasons why we took the model-based RL route in our RSS paper.

Anyway, in our RSS paper, we focused on the problem of deformable fabric manipulation. In particular, given a goal image of a fabric in any configuration, can we train a pick-and-place action policy that will manipulate the fabric from an arbitrary starting configuration to the goal configuration? For Visual Foresight, we trained a deep recurrent neural network model that could predict full 56x56 resolution images of fabric. We predicted depth images in addition to color images, making the model “VisuoSpatial.” Specifically, we used Stochastic Variational Video Prediction (SV2P) as our model. The wording “Stochastic Variational” means the model samples a latent variable before generating images, and the stochastic nature of that variable means the model is not deterministic. This is an important design aspect; see the SV2P paper for further details. But, as you might imagine, this is a very deep, recurrent, and complex model. Is all this complexity needed?

Perhaps not! In a paper at the Workshop on Algorithmic Foundations of Robotics (WAFR) this year, Terry Suh and Russ Tedrake of MIT show that, in fact, linear models can be effective in Visual Foresight.

Wait, really?

Let’s dive into that work in more detail, and see how it contrasts to our paper. I believe there are great insights to be gained from reading the WAFR paper.

In this paper, Terry Suh and Russ Tedrake focus on the task of pushing small objects into a target zone, such as pushing diced onions or carrots not unlike how a human chef might need do so. Their goal is to train a pushing policy that can learn and act based on greyscale images. They make a similar argument that we do in our RSS 2020 paper about the difficulty of knowing the “underlying physical state.” For us, “state” means vertices of cloth. For them, “state” means knowing all poses of objects. Since that’s hard with all these small objects piled upon each other, learning from images is likely easier.

The actions are 4D vectors $\mathbf{u}$ which have (a) the 2D starting coordinates, (b) the scalar push orientation, and (c) the scalar push length. They use Pymunk for simulation, which I’ve never heard of before. That’s odd, why not use PyBullet, which might be more standardized for robotics? I have explicitly been able to simulate this kind of environment in PyBullet.

That having been said, let’s consider first how (a) they determine actions, and (b) their visual foresight video prediction model.

Section 2.2 describes how they pick actions (for all methods they benchmark). Unlike us, they do not use the Cross Entropy Method (CEM) — there is no action sampling plus distribution refitting as happens in the CEM. The reason is that they can define a Lyapunov function which accurately characterizes performance on their task, and furthermore, they can minimize for it to get a desired action. The Lyapunov function $V$ is defined as:

$V(\mathcal{X}) = \frac{1}{|\mathcal{X}|} \sum_{p_i \in \mathcal{X}} \min_{p_j \in \mathcal{S}_d} \|p_i - p_j\|_{p}$

where $\mathcal{X} = \{p_i\}$ is the set of all 2D particle positions, and $\mathcal{S}_d$ is the desired target set for the particles. The notation $\| \cdot \|_p$ simply refers to a distance metric in the $p$-norm.

The figure above is from the paper, who visualizes the Lyapunov function. It is interpreted as a distance between a discrete set of points and a continuous target set. There’s a pentagon at the center indicating the target set. In their instantiation of the Lyapunov function, if all non-zero pixels (nonzero means carrots, due to height thresholding) in the image of the scene coincide with the pentagon, then the element-wise product of the two images is 0 everywhere, and summing it all will result in 0.

The paper makes the assumption that:

for every image that is not in the target set, we can always find a small particle to push towards the target set and decrease the value of the Lyapunov function.

I agree. While there are cases when pushing particles inwards might result in higher values (i.e., worse performance) due to pushing particles inside the zone to be outside of it, I think it is always possible to find some movement that gets a greater number of particles in the target. If anyone has a counter-example, feel free to share. This assumption may be more true for convex target sets, but I don’t think the authors make that assumption since they test on targets shaped “M”, “I”, and “T” later.

Overall, the controller appears to be accurate enough so that the prediction model performance is the main bottleneck. So which is better: deep or switched-linear? Let’s now turn to that, along with the “visual foresight” aspect of the paper.

Their linear model is “switched-linear”. This is an image-to-image mapping based on a linear map characterized by

$y_{k+1} = \mathbf{A}_i y_k$

for $i \in \{1, 2, \ldots, |\mathcal{U}|\}$, where $\mathcal{U}$ is the discretized action space and $y_k \in \mathbb{R}^{N^2}$ represents the flattened $N \times N$ image at time $k$. Furthermore, $\mathbf{A}_i \in \mathbb{R}^{N^2 \times N^2}$. This is a huge matrix, and there are as many of these matrices as there are actions! This appears to require a lot of storage.

My first question after reading this was: when they train the model using pairs of current and successor images $(y_{k}, y_{k+1})$, is it possible to train all the $\mathbf{A}_i$ matrices?

Or are we restricted to only the matrix corresponding to the action that was chosen to transform $y_k$ into $y_{k+1}$? If this were true, that is a serious limitation. I breathed a sigh of relief when the authors clarified that they can reuse training samples, up to the push length. They discretized the push length by 5, and then got 1000 data points (image pairs) for each of those, for 5000 total. Then they find the optimum matrices (and actions, since matrices are actions here) by the ordinary least squares

Their deep models are referred to as DVF-Affine and DVF-Original. The affine one is designed for fairer comparison with the linear model, so it’s an image-to-image prediction model, with five separate neural networks for each of the discretized push lengths. DVF-Original takes the action as an additional input, while DVF-Affine does not.

Surprisingly, their results show that their linear model has lower prediction error on a held-out set of 1000 test images. This should directly translate to better performance on the actual task, since more accurate models mean the Lyapunov function will be driven down to 0 faster. Indeed, their results confirm the prediction error results, in the sense that linear models are the best or among the best in terms of task performance.

Now we get to the big question: why are linear models better than deeper ones for these experiments? I thought of these while reading the paper:

• The carrots are very tiny in the images, so perhaps the 32x32 resolution makes it hard to accurately capture the fine-grained nature of the carrots.

• The images are grayscale and small, which means linear models may work better as opposed to if the images were larger. At some point the “$N$” in their paper will grow too large to be used with linear models. (Of course with larger images, the problem of video prediction becomes exponentially harder. Heck, we only used 56x56 in our paper, and the SV2P paper used 64x64 images.)

• Perhaps there’s just not enough data? It looks like the experiments use 23,000 data points to train DVF-Original, and 5,000 data points for DVF-Affine? For a point of comparison, we used about 105,000 images of cloth.

• Furthermore, the neural networks are trained directly on the pixels in an end-to-end manner using the Frobenius norm loss (basically mean square error on pixels). In contrast, models such as SV2P are trained using Variational AutoEncoder style losses, which may be more powerful. In addition, the SV2P paper explicitly stated that they performed a multi-stage training procedure since a single end-to-end procedure tends to converge to less than ideal solutions.

• Perhaps the problem has a linear nature to it? While reading the paper, I was reminded of the thought-provoking NeurIPS 2018 paper on how simple random search on linear models is competitive for reinforcement learning on MuJoCo environments.

• Judging from Figure 11, the performance of the better neural network model seems almost as good as the linear one. Maybe the task is too easy?

Eventually, the authors discuss their explanation: they believe that their problem has natural linearity in it. In other words, there is inductive bias in the problem. Inductive bias in machine learning is a fancy way of saying that different machine learning models make different assumptions about the prediction problem.

Overall, the WAFR 2020 paper is effective and thought-provoking. It makes me wonder if we should have at least tried a linear model that could perhaps predict edges or corners of cloth while trying to abstract away other details. I doubt it would work for complex fabric manipulation tasks, but perhaps for simpler ones. Hopefully someone will explore this in the future!

Here are the papers discussed in this post, ordered by publication date. I focused mostly on the WAFR 2020 paper, and the others are: my paper with Ryan for RSS, the two main Visual Foresight papers, and the S2VP paper that uses the video prediction model we’ve used for our paper.

# Offline (Batch) Reinforcement Learning: A Review of Literature and Applications

Reinforcement learning is a promising technique for learning how to perform tasks through trial and error, with an appropriate balance of exploration and exploitation. Offline Reinforcement Learning, also known as Batch Reinforcement Learning, is a variant of reinforcement learning that requires the agent to learn from a fixed batch of data without exploration. In other words, how does one maximally exploit a static dataset? The research community has grown interested in this in part because larger datasets are available that might be used to train policies for physical robots. Exploration with a physical robot may risk damage to robot hardware or surrounding objects. In addition, since offline reinforcement learning disentangles exploration from exploitation, it can help provide standardized comparisons of the exploitation capability of reinforcement learning algorithms.

Offline reinforcement learning, henceforth Offline RL, is closely related to imitation learning (IL) in that the latter also learns from a fixed dataset without exploration. However, there are several key differences.

• Offline RL algorithms (so far) have been built on top of standard off-policy Deep Reinforcement Learning (Deep RL) algorithms, which tend to optimize some form of a Bellman equation or TD difference error.

• Most IL problems assume an optimal, or at least a high-performing, demonstrator which provides data, whereas Offline RL may have to handle highly suboptimal data.

• Most IL problems do not have a reward function. Offline RL considers rewards, which furthermore can be processed after-the-fact and modified.

• Some IL problems require the data to be labeled as expert versus non-expert. Offline RL does not make this assumption.

I preface the IL descriptions with “some” and “most” because there are exceptions to every case and that the line between methods is not firm, as I emphasized in a blog post about combining IL and RL.

Offline RL is therefore about deriving the best policy possible given the data. This gives us the hope of out-performing the demonstration data, which is still often a difficult problem for imitation learning. To be clear, in tabular settings with infinite state visitation, it can be shown that algorithms such as Q-learning converge to an optimal policy despite potentially sub-optimal off-policy data. However, as some of the following papers show, even “off-policy” Deep RL algorithms such as the Deep Q-Network (DQN) algorithm require substantial amounts of “on-policy” data from the current behavioral policy in order to learn effectively, or else they risk performance collapse.

For a further introduction to Offline RL, I refer you to (Lange et al, 2012). It provides an overview of the problem, and presents Fitted Q Iteration (Ernst et al., 2005) as the “Q-Learning of Offline RL” along with a taxonomy of several other algorithms. While useful, (Lange et al., 2012) is mostly a pre-deep reinforcement learning reference which only discusses up to Neural Fitted Q-Iteration and their proposed variant, Deep Fitted Q-Iteration. The current popularity of deep learning means, to the surprise of no one, that recent Offline RL papers learn policies parameterized by deeper neural networks and are applied to harder environments. Also, perhaps unsurprisingly, at least one of the authors of (Lange et al., 2012), Martin Riedmiller, is now at DeepMind and appears to be working on … Offline RL.

In the rest of this post, I will summarize my view of the Offline RL literature. From my perspective, it can be roughly split into two categories:

• those which try and constrain the reinforcement learning to consider actions or state-action pairs that are likely to appear in the data.

• those which focus on the dataset, either by maximizing the data diversity or size while using strong off-policy (but not specialized to the offline setting) algorithms, or which propose new benchmark environments.

I will review the first category, followed by the second category, then end with a summary of my thoughts along with links to relevant papers.

As of May 2020, there is a recent survey from Professor Sergey Levine of UC Berkeley, whose group has done significant work in Offline RL. I began drafting this post well before the survey was released but engaged in my bad “leave the draft alone for weeks” habit. Professor Levine chooses a different set of categories, as his papers cover a wider range of topics, so hopefully this post provides an alternative yet useful perspective.

## Off-Policy Deep Reinforcement Learning Without Exploration

(Fujimoto et al., 2019) was my introduction to Offline RL. I have a more extensive blog post which dissects the paper, so I’ll do my best to be concise in this post. The main takeaway is showing that most “off-policy algorithms” in deep RL will fail when solely shown off-policy data due to extrapolation error, where state-action pairs $(s,a)$ outside the data batch can have arbitrarily inaccurate values, which adversely affects algorithms that rely on propagating those values. In the online setting, exploration would be able to correct for such values because one can get ground-truth rewards, but the offline case lacks that luxury.

The proposed algorithm is Batch Constrained deep Q-learning (BCQ). The idea is to run normal Q-learning, but in the maximization step (which is normally $\max_{a’} Q(s’,a’)$), instead of considering the max over all possible actions, we want to only consider actions $a’$ such that $(s’,a’)$ actually appeared in the batch of data. Or, in more realistic cases, eliminate actions which are unlikely to be selected by the behavior policy $\pi_b$ (the policy that generated the static data).

BCQ trains a generative model — a Variational AutoEncoder — to generate actions that are likely to be from the batch, and a perturbation model which further perturbs the action. At test-time rollouts, they sample $N$ actions via the generator, perturb each, and pick the action with highest estimated Q-value.

They design experiments as follows, where in all cases there is a behavioral DDPG agent which generates the batch of data for Offline RL:

• Final Buffer: train the behavioral agent for 1 million steps with high exploration, and pool all the logged data into a replay buffer. Train a new DDPG agent from scratch, only on that replay buffer with no exploration. Since the behavioral agent will have been learning along those 1 million steps, there should be high “state coverage.”

• Concurrent: as the behavioral agent learns, train a new DDPG agent concurrently (hence the name) on the behavioral DDPG replay buffer data. Again, there is no exploration for the new DDPG agent. The two agents should have identical replay buffers throughout learning.

• Imitation Learning: train the behavioral agent until it is sufficiently good, then run it for 1 million steps (potentially with more noise to increase state coverage) to get the replay buffer. The difference with “final buffer” is that the 1 million steps are all from the same policy, whereas the final buffer was throughout 1 million steps, which may have resulted in many, many gradient updates depending on the gradient-to-env-steps hyper-parameter.

The biggest surprise is that even in the concurrent setting, the new DDPG agent fails to learn well! To be clear: the agents start at the beginning with identical replay buffers, and the offline agent draws minibatches directly from the online agent’s buffer. I can only think of a handful of differences in the training process: (1) the randomness in the initial policy and (2) noise in minibatch sampling. Am I missing anything? Those factors should not be significant enough to lead to divergent performance. In contrast, BCQ is far more effective at learning offline from the given batch of DDPG data.

When reading papers, I often find myself wondering about the relationship between algorithms in batches (pun intended) of related papers. Conveniently, there is a NeurIPS 2019 workshop paper where Fujimoto benchmarks algorithms. Let’s turn to that.

## Benchmarking Batch Deep Reinforcement Learning Algorithms

This solid NeurIPS 2019 workshop paper, by the same author of the BCQ paper, makes a compelling case for the need to evaluate Batch RL algorithms under unified settings. Some research, such as his own, shows that commonly-used off policy DeepRL algorithms fail to learn in an offline fashion, whereas (Agarwal et al., 2020) counter this, but with the caveat of using a much larger dataset.

One of the nice things about the paper is that it surveys some of the algorithms researchers have used for Batch RL, including Quantile Regression DQN (QR-DQN), Random Ensemble Mixture (REM), Batch Constrained Deep Q-Learning (BCQ), Bootstrapping Error Accumulation Reduction Q-Learning (BEAR-QL), KL-Control, and Safe Policy Improvement with Baseline Bootstrapping DQN (SPIBB-DQN). All these algorithms are specialized for the Batch RL setting with the exception of QR-DQN, which is a strong off-policy algorithm shown to work well in an offline setting.

Now, what’s the new algorithm that Fujimoto proposes? It’s a discrete version of BCQ. The algorithm is delightfully straightforward:

My “TL;DR”: train a behavior cloning network to predict actions of the behavior policy based on its states. For the Q-function update on iteration $k$, change the maximization over the successor state actions to only consider actions satisfying a threshold:

$\mathcal{L}(\theta) = \ell_k \left(r + \gamma \cdot \Bigg( \max_{a' \; \mbox{s.t.} \; \frac{G_\omega(a'|s')}{\max \hat{a} \; G_\omega(\hat{a}|s')} > \tau} Q_{\theta'}(s',a') \Bigg) - Q_\theta(s,a) \right)$

When executing the policy during test-time rollouts, we can use a similar threshold:

$\pi(s) = \operatorname*{argmax}_{a \; \mbox{s.t.} \; \frac{G_\omega(a'|s')}{\max \hat{a} \; G_\omega(\hat{a}|s')} > \tau} Q_\theta(s,a)$

Note the contrast where normally in Q-learning, we’d just do the max or argmax over the entire set of valid actions. Therefore, we will end up ignoring some actions that potentially have high Q-values, but that’s fine (and desirable!) if those actions have vastly over-estimated Q-values.

• The parallels are obvious between $G_\omega$ in continuous versus discrete BCQ. In the continuous case, it is necessary to develop a generative model which may be complex to train. In the discrete case, it’s much simpler: run behavior cloning!

• I was confused about why BCQ does the behavior cloning update of $\omega$ inside the for loop, rather than beforehand. Since the data is fixed, this seems suboptimal since the optimization for $\theta$ will rely on an inaccurate model $G_\omega$ during the first few iterations. After contacting Fujimoto, he agreed that it is probably better to move the optimization before the loop, but his results were not significantly better.

• There is a $\tau$ parameter we can vary. What happens when $\tau = 0$? Then it’s simple: standard Q-learning, because any action should have non-zero probability from the generative model. Now, what about $\tau=1$? In practice, this is exactly behavior cloning, because when the policy selects actions it will only consider the action with highest $G_\omega$ value, regardless of its Q-value. The actual Q-learning portion of BCQ is therefore completely unnecessary since we ignore the Q-network!

• According to the appendix, they use $\tau = 0.3$.

There are no theoretical results here; the paper is strictly experimental. The experiments are on nine Atari games. The batch of data is generated from a partially trained DQN agent over 10M steps (50M steps is standard). Note the critical design choice of whether:

• we take a single fixed snapshot (i.e., a stationary policy) and roll it out to get steps, or
• we take logged data from an agent during its training run (i.e., a non-stationary policy).

Fujimoto implements the first case, arguing that it is more realistic, but I think that claim is highly debatable. Since the policy is fixed, Fujimoto injects noise by setting $\epsilon=0.2$ 80% of the time, and setting $\epsilon=0.001$ otherwise. This must be done on a per-episode basis — it doesn’t make sense to change epsilons within an episode!

What are some conclusions from the paper?

• Discrete BCQ seems to be the best of the “batch RL” algorithms tested. But the curves look really weird: BCQ performance shoots up to be at or slightly above the noise-free policy, but then stagnates! I should also add: exceeding the underlying noise-free policy is nice, but the caveat is that it’s from a partially trained DQN, which is a low bar.

• For the “standard” off-policy algorithms of DQN, QR-DQN, and REM, QR-DQN is the winner, but still under-performs a noisy behavior policy, which is unsatisfactory. Regardless, trying QR-DQN in an offline setting, even though it’s not specialized for that case, might be a good idea if the dataset is large enough.

• Results confirm some results from (Agarwal et al., 2020) in that distributional RL aids in exploitation), but that the success they were observing is highly specific to settings Agarwal used: a full 50M history of a teacher’s replay buffer, with a changing snapshot, plus noise from sticky actions.

Here’s a summary of results in their own words:

Although BCQ has the strongest performance, on most games it only matches the performance of the online DQN, which is the underlying noise-free behavioral policy. These results suggest BCQ achieves something closer to robust imitation, rather than true batch reinforcement learning when there is limited exploratory data.

This brings me to one of my questions (or aspirations, if you put it that way). Is it possible to run offline RL, and reliably exceed the noise-free behavior policy? That would be a dream scenario indeed.

## Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

This NeurIPS 2019 paper is highly related to Fujimoto’s BCQ paper covered earlier, in that it also focuses on an algorithm to constrain the distribution of actions considered when running Q-learning in a pure off-policy fashion. It identifies a concept known as bootstrapping error which is clearly described in the abstract alone:

We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it.

I immediately thought: what’s the difference between bootstrapping error here versus extrapolation error from (Fujimoto et al., 2019)? Both terms can be used to refer to the same problem of propagating inaccurate Q-values during Q-learning. However, extrapolation error is a broader problem that appears in supervised learning contexts, whereas bootstrapping is specific to reinforcement learning algorithms that rely on bootstrapped estimates.

The authors have an excellent BAIR Blog post which I highly recommend because it provides great intuition on how bootstrapping error affects offline Q-learning on static datasets. For example, this figure below shows that in the second plot, we may have actions $a$ that are outside the distribution of actions (OOD is short for out-of-distribution) induced by the behavior policy $\beta(a|s)$, indicated with the dashed line. Unfortunately, if those actions have $Q(s,a)$ values that are much higher, then they are used in the bootstrapping process for Q-learning to form the targets for Q-learning updates.

Incorrectly high Q-values for OOD actions may be used for backups, leading to accumulation of error. Figure and caption credit: Aviral Kumar.

They also have results showing that if one runs a standard off-the-shelf off-policy (not offline) RL algorithm, that simply increasing the size of the static dataset does not appear to mitigate performance issues – which suggests the need for further study.

The main contributions of their paper are: (a) theoretical analysis that carefully constraining the actions considered during Q-learning can mitigate error propagation, and (b) a resulting practical algorithm known as “Bootstrapping Error Accumulation Reduction” (BEAR). (I am pretty sure that “BEAR” is meant to be a spin on “BAIR,” which is short for Berkeley Artificial Intelligence Research.)

The BEAR algorithm is visualized below. The intuition is to ensure that the learned policy matches the support of the action distribution from the static data. In contrast, an algorithm such as BCQ focuses on distribution matching (center). This distinction is actually pretty powerful; only requiring a support match is a much weaker assumption, which enables Offline RL to more flexibly consider a wider range of actions so long as the batch of data has used those actions at some point with non-negligible probability.

Illustration of support constraint (BEAR) (right) and distribution-matching constraint (middle). Figure and caption credit: Aviral Kumar.

To enforce this in practice, BEAR uses what’s known as the Maximum Mean Discrepancy (MMD) distance between actions from the unknown behavior policy $\beta$ and the actor $\pi$. This can be estimated directly from samples. Putting everything together, their policy improvement step for actor-critic algorithms is succinctly represented by Equation 1 from the paper:

$\pi_\phi := \max_{\pi \in \Delta_{|S|}} \mathbb{E}_{s \sim \mathcal{D}} \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ \min_{j=1,\ldots,K} \hat{Q}_j(s, a)\right] \quad \mbox{s.t.} \quad \mathbb{E}_{s \sim \mathcal{D}} \Big[ \text{MMD}(\mathcal{D}(\cdot|s), \pi(\cdot|s)) \Big] \leq \varepsilon$

The notation is described in the paper, but just to clarify: $\mathcal{D}$ represents the static data of transitions collected by behavioral policy $\beta$, and the $j$ subscripts are from the ensemble of Q-functions used to compute a conservative estimate of Q-values. This is the less interesting aspect of the policy update as compared to the MMD constraint; in fact the BAIR Blog post doesn’t include the ensemble in the policy update. As far as I can tell, there is no ablation study that tests just using one or two Q-networks, so I wonder which of the two is more important: the ensemble of networks, or the MMD constraint?

The most closely related algorithm to BEAR is the previously-discussed BCQ (Fujimoto et al., 2019). How do they compare? The BEAR authors (Kumar et al., 2019) claim:

• Their theory shows convergence properties under weaker assumptions, and they are able to bound the suboptimality of their approach.

• BCQ is generally better when off-policy data is collected by an expert, but BEAR is better when data is collected by a weaker (or even random) policy. They claim this is because BCQ too aggressively constrains the distribution of actions, and this matches the interpretation of BCQ as matching the distribution of the policy of the data batch, whereas BEAR focuses on only matching the action support.

Upon reading this, I became curious to see if there’s a way to combine the strengths of both of the algorithms. I am also not entirely convinced that MuJoCo is the best way to evaluate these algorithms, so we should hopefully look at what other datasets might appear in the future so that we can perform more extensive comparisons of BEAR and BCQ.

At this point, we now consider papers that are in the second category – those which, rather than constrain actions in some way, focus on investigating what happens with a large and diverse dataset while maximizing the exploitation capacity of standard off-policy Deep RL algorithms.

## An Optimistic Perspective on Offline Reinforcement Learning

Unlike the prior papers, which present algorithms to constrain the set of considered actions, this paper argues that it is not necessary to use a specialized Offline RL algorithm. Instead, use a stronger off-policy Deep RL algorithm with better exploitation capabilities. I especially enjoyed reading this paper, since it gave me insights on off-policy reinforcement learning, and the experiments are also clean and easy to understand. Surprisingly, it was rejected from ICLR 2020, and I’m a little concerned about how a paper with this many convincing experimental results can get rejected. The reviewers also asked why we should care about Offline RL, and the authors gave a rather convincing response! (Fortunately, the paper eventually found a home at ICML 2020.)

Here is a quick summary of the paper’s experiments and contributions. When discussing the paper or referring to figures, I am referencing the second version on arXiv, which corresponds to the ICLR 2020 submission and used “Batch RL” instead of “Offline RL” so we’ll use both terms interchangeably. The paper was previously titled “Striving for Simplicity in Off-Policy Deep Reinforcement Learning.”

• To form the batch for Offline RL, they use logged data from 50M steps of standard online DQN training. In general, one step is four environment frames, so this matches the 200M frame case which is standard for Atari benchmarks. I believe the community has settled on the 1 step to 4 frame ratio. As discussed in (Machado et al., 2018), to introduce stochasticity, the agents employ sticky actions. So, given this logged data, let’s run Batch RL, where we run off-policy deep Q-learning algorithms with a 50M-sized replay buffer, and sample items uniformly.

• They show that the off-policy, distributional-based DeepRL algorithms Categorical DQN (i.e., C51) and Quantile Regression DQN (i.e., QR-DQN), when trained solely on that logged data (i.e., in an offline setting), actually outperform online DQN!! See Figure 2 in the paper, for example. Be careful about what this claims means: C51 and QR-DQN are already known to be better than vanilla DQN, but the experiments show that even in the absence of exploration for those two methods, they still out-perform online (i.e., with exploration) DQN.

• Incidentally, offline C51 and offline QR-DQN also out-perform offline DQN, which as expected, is usually worse than online DQN. (To be fair, Figure 2 suggests that in 10-15 out of 60 games, offline DQN can actually outperform the online variant.) Since the experiments disentangle exploration from exploitation, we can explain the difference between performance of offline DQN versus offline C51 or QR-DQN as due to exploitation capability.

• Thus so far we have the following algorithms, from worst to best with respect to game score: offline DQN, online DQN, offline C51, and offline QR-DQN. They did not present a full result of offline C51 except for a few games in the Appendix but I’m assuming that QR-DQN would be better in both offline and online cases. In addition, I also assume that online C51 and online QR-DQN would outperform their offline variants, at least if their offline variants are trained on DQN-generated data.

• To add further evidence that improving the base off-policy Deep RL algorithm can work well in the Batch RL setting, their results in Figure 4 suggest that using Adam as the optimizer instead of RMSprop for DQN is by itself enough to get performance gains. In that this offline DQN can even outperform online DQN on average! I’m not sure how much I can believe this result, because Adam can’t offer that much of an improvement, right?

• They also experiment with a continuous control variant, using 1M samples from a logged training run of DDPG. They apply Batch-Constrained Q-learning from (Fujimoto et al., 2019) as discussed above, and find that it performs reasonably well. But they also find that they can simply use Twin-Delayed DDPG (i.e., TD3) from (Fujimoto et al., 2018) (yes, the same guy!) and train normally in an off-policy fashion to get better results than offline DDPG. Since TD3 is known as a stronger off-policy continuous control deep Q-learning algorithm than DDPG, this further bolsters the paper’s claims that all we need is a stronger off-policy algorithm for effective Batch RL.

• Finally, from the above observations, they propose their Random Ensemble Mixture (REM) algorithm, which uses an ensemble of Q-networks and enforces Bellman consistency among random convex combinations. This is similar to how Dropout works. There are offline and online versions of it. In the offline setting, REM outperforms C51 and QR-DQN despite being simpler. By “simpler” the authors mainly refer to not needing to estimate a full distribution of the value function for a given state, as distributional methods do.

That’s not all they did. In an older version of the paper, they also tried experiments with logged data from a training run of QR-DQN. However, the lead author told me he removed those results since there were too many experiments which were confusing readers. In addition, for logged data from training QR-DQN, it is necessary to train an even stronger off-policy Deep RL algorithm to out-perform the online QR-DQN algorithm. I have to admit, sometimes I was also losing track of all the experiments being run in this paper.

Here is a handy visualization of some algorithms involved in the paper: DQN, QR-DQN, Ensemble-DQN (their baseline) and REM (their algorithm):

My biggest takeaway from reading this paper is that in Offline RL, the quality of the data matters significantly, and it is better to use data from many different policies rather than one fixed policy. That they get logged data from a training run means that, literally, every four steps, there was a gradient update to the policy parameters and thus a change to the policy itself. This induces great diversity in the data for Offline RL. Indeed, (Fujimoto et al., 2019) argues that the success of REM and off-policy algorithms more generally depends on the training data composition. Thus, it is not generally correct to think of these papers contradicting each other; they are more accurately thought of as different ways to achieve the same goal. Perhaps the better way going forward is simply to use larger and larger datasets with strong off-policy algorithms, while also perhaps specializing those off-policy algorithms for the batch setting.

## IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

This paper proposes the algorithm IRIS: Implicit Reinforcement without Interaction at Scale. It is specialized for offline learning from large-scale robotics datasets, where the demonstrations may be either suboptimal or highly multi-modal. The algorithm is motivated by the same off-policy, Batch RL considerations as other papers I discuss here, and I found this paper because it cited a bunch of them. Their algorithm is visualized below:

To summarize:

• IRIS splits control into “high-level” and “low-level” controllers. The high-level mechanism, at a given state $s_t$, must pick a new goal state $s_g$. Then, the low-level mechanism is conditioned on that goal state, and produces the actual actions $a \sim \pi_{im} (s_t | s_g)$ to take.

• The high-level policy is split in two parts. The first samples several goal proposals. The second picks the best goal proposal to pass to the low-level controller.

• The low-level controller, given the goal $s_g$, takes $T$ actions conditioned on that goal. Then, it returns control to the high level policy, which re-samples the goal state.

• The episode terminates when the agent gets sufficiently close to the true goal state. This is a continuous state domain, so they simply pick a distance threshold to the state. They are also in the sparse reward domain, adding another challenge.

How are the components trained?

• The first part of the high-level controller uses a goal conditional Variational AutoEncoder (cVAE). Given a sequence of states in the data, IRIS samples pairs that are $T$ time steps apart, i.e., $(s_t, s_{t+T})$. The encoder $E(s_{t},s_{t+T})$ maps the tuple to a set of latent variables for a Gaussian, i.e., $\mu, \sigma =E(s_{t},s_{t+T})$. The decoder must construct the future state: $\hat{s}_{t+T} \sim D(s_t, z)$ where $z$ is a Gaussian sampled from $\mu$ and $\sigma$. This is for training; for test time, they sample $z$ from a standard normal $z \sim \mathcal{N}(0,1)$ (with regularization during training) and pass it to the decoder, so that it produces goal states.

• The second part uses an action cVAE as part of their simpler variant of Batch Constrained Deep Q-learning (discussed at the beginning of this blog post) for the value function in the high-level controller. This cVAE, rather than predicting goals, will predict actions conditioned on a state. This can be trained by sampling state-action pairs $(s_t,a_t)$ and having the cVAE predict $a_t$. They can then use it in their BCQ algorithm because the cVAE will model actions that are more likely to be part of the training data.

• The low-level controller is a recurrent neural network that, given $s_t$ and $s_g$, produces $a_t$. It is trained with behavior cloning, and therefore does not use Batch RL. But, how does one get the goal? It’s simple: since IRIS assumes the low-level controller runs for a fixed number of steps (i.e., $T$ steps) then they take consecutive state-action sequences of length $T$ and then treat the last state as the goal. Intuitively, the low-level controller trained this way will be able to figure out how to get from a start state to a “goal” state in $T$ steps, where “goal” is in quotes because it is not a true environment goal but one which we artificially set for training. This reminds me of Hindsight Experience Replay, which I have previously dissected.

Some other considerations:

• They argue that IRIS is able to handle diverse solutions because the goal cVAE can sample different goals, to explicitly take diversity into account. Meanwhile, the low-level controller only has to model short-horizon goals at a time “resolution” that does not easily permit many solutions.

• They argue that IRIS can handle off-policy data because their BCQ will limit actions to those likely to be generated by the data, and hence the value function (which is used to select the goal) will be more accurate.

• They split IRIS into higher and lower level controllers because in theory this may help to handle for suboptimal demonstrations — the high-level controller can pick high value goals, and the low-level controller just has to get from point A to point B. This is also pretty much why people like hierarchies in general.

Their use of Batch RL is interesting. Rather than using it to train a policy, they are only using it to train a value function. Thus, this application can be viewed as similar to papers that are concerned with off-policy RL but only for the purpose of evaluating states. Also, why do they argue their variant of BCQ is simpler? I think it is because they eschew from training a perturbation model, which was used to optimally perturb the actions that are used for candidates. They also don’t seem to use a twin critic.

They evaluate IRIS on three datasets. Two use their prior work, RoboTurk. You can see an overview on the Stanford AI Blog here. I have not used RoboTurk before so it may be hard for me to interpret their results.

• Graph Reach: they use a simple 2D navigation example, which is somewhat artificial but allows for easy testing of multi-modal and suboptimal examples. Navigation tasks are also present in other papers that test for suboptimal demonstrations, such as SAVED from Ken Goldberg’s lab.

• Robosuite Lift: this involves the Robosuite Lift data, where a single human performed teleoperation (in simulation) using RoboTurk, to lift an object. The human intentionally used suboptiomal demonstrations..

• RoboTurk Can Pick and Place: now they use a pick-and-place task, this time using RoboTurk to get a diverse set of samples due to using different human operators. You can see an overview on the Stanford AI Blog here. Again, I have not used Roboturk, but it appears that this is the most “natural” of the environments tested.

Their experiments benchmark against BCQ, which is a reasonable baseline.

Overall, I think this paper has a mix of both the “action constraining” algorithms discussed in this blog post, and the “learning from large scale datasets” papers. It was the first to show that offline RL could be used as part of the process for robot manipulation. Another project that did something similar, this time with physical robots, is from DeepMind, to which we now turn.

## Scaling Data-driven Robotics with Reward Sketching and Batch Reinforcement Learning

This recent DeepMind paper is the third one I discuss which highlights the benefits of a large, massive offline dataset (which they call “NeverEnding Storage”) coupled with a strong off-policy reinforcement learning algorithm. It shows what is possible when combining ideas from reinforcement learning, human-computer interaction, and database systems. The approach consists of five major steps, as nicely indicated by the figure:

In more detail, they are:

1. Get demonstrations. This can be from a variety of sources: human teleoperation, scripted policies, or trained policies. At first, the data is from human demonstrations or scripted policies. But, as robots continue to train and perform tasks, their own trajectories are added to the NeverEnding Storage. Incidentally, this paper considers the multi-task setup, so the policies act on a variety of tasks, each of which has its own starting conditions, particular reward, etc.

2. Reward sketching. A subset of the data points are selected for humans to indicate rewards. Since it involves human intervention, and because reward design is fiendishly difficult, this part must be done with care, and certainly cannot be done by having humans slowly and manually assign a number to every frame. (I nearly vomit when simply thinking about doing that.) The authors cleverly engineered a GUI where a human can literally sketch a reward, hence the name reward sketching, to seamlessly get rewards (between 0 and 1) for each frame.

3. Learning the reward. The system trains a reward function neural network $r_\psi$ to predict task-specific (dense) rewards from frames (i.e., images). Rather than regress directly on the sketched values, the proposed approach involves taking two frames $x_t$ and $x_q$ within the same episode, and enforcing consistency conditions with the reward functions via hinge losses. Clever! When the reward function updates, this can trigger retroactive re-labeling of rewards per time step in the NES.

4. Batch RL. A specialized Batch RL algorithm is not necessary because of the massive diversity of the offline dataset, though they do seem to train task-specific policies. They use a version of D4PG, short for “Distributed Distributional Deep Deterministic Policy Gradients” which is … a really good off-policy RL algorithm! Since the NES contains data from many tasks, if they are trying to optimize the learned reward for a task, they will draw 75% of the minibatch from all of the NES, and draw the remaining 25% from task-specific episodes. I instantly made the connection to DeepMind’s “Observe and Look Further (arXiv 2018)” paper (see my blog post here) which implements a 75-25 minibatch ratio among demonstrator and agent samples.

5. Evaluation. Periodically evaluate the robot and add new demonstrations to NES. Their experiments consist of a Sawyer robot facing a 35 x 35 cm basket of objects, and the tasks generally involve grasping objects or stacking blocks.

6. Go back to step (1) and repeat, resulting in over 400 hours of video data.

There is human-in-the-loop involved, but they argue (reasonably, I would add) that reward sketching is a relatively simple way of incorporating humans. Furthermore, while human demonstrations are necessary, those are ideally drawn from existing datasets.

They say they will release their dataset so that it can facilitate development of subsequent Batch RL algorithms, though my impression is that we might as well deploy D4PG, so I am not sure if this will spur more Batch RL algorithms. On a related note, if you are like me and have trouble following all of the “D”s in the algorithm and all of DeepMind’s “state of the art” reinforcement learning algorithms, DeepMind has a March 31 blog post summarizing the progression of algorithms on Atari. I wish we had something similar for continuous control, though.

Here are some comparisons between this and the ones from (Agarwal et al., 2020) and (Mandlekar et al., 2020) previously discussed:

• All papers deal with Batch RL from a large set of robotics-related data, though the datasets themselves differ: Atari versus RoboTurk versus this new dataset, which will hopefully be publicly available. This paper appears to be the only one capable of training Batch RL policies to perform well on new tasks. The analogue for Atari would be training a Batch RL agent on several games, and then applying it (or fine-tuning it) to a new Atari game, but I don’t think this has been done.

• This paper agrees with the conclusions of (Agarwal et al., 2020) that having a sufficiently large and diverse dataset is critical to the success of Offline RL.

• This paper uses D4PG as a very powerful, offline RL algorithm for learning policies, whereas (Agarwal et al., 2020) proposes a simpler version of Quantile-Regression DQN for discrete control, and (Mandlekar et al., 2020) only use Batch RL to train a value function instead of a policy.

• This paper proposes the novel reward sketching idea, whereas (Agarwal et al., 2020) only use environments that give dense rewards, and (Mandlekar et al., 2020) use environments with sparse rewards that indicate task success.

• This paper does not factorize policies into lower and higher level controllers, unlike (Mandkelar et al., 2020), though I assume in principle it is possible to merge the ideas.

In addition to the above comparisons, I am curious about the relationship between this paper and RoboNet from CoRL 2019. It seems like both projects are motivated by developing large datasets for robotics research, though the latter may be more specialized to visual foresight methods, but take my judgment with a grain of salt.

Overall, I have hope that, with disk space getting cheaper and cheaper, we will eventually have robots deployed in fleets that can draw upon this storage in some way.

# Concluding Remarks and References

What are some of the common themes or thoughts I had when reading these and related papers? Here are a few:

• When reading these papers, take careful note as to whether the data is generated from a non-stationary or a stationary policy. Furthermore, how diverse is the dataset?

• The “data diversity” and “action constraining” aspects of this literature may be complementary, but I am not sure if anyone has shown how well those two mix.

• As I mention in my blog posts, it is essential to figure out ways that an imitator can outperform the expert. While this has been demonstrated with algorithms that combine RL and IL with exploration, the Offline RL setting imposes extra constraints. If RL is indeed powerful enough, maybe it is still able to outperform the demonstrator in this setting. Thus, when developing algorithms for Offline RL, merely meeting the demonstrator behavior is not sufficient.

Happy offline reinforcement learning!

Here is a full listing of the papers covered in this blog post, in order of when I introduced the paper.

Finally, here are another set of Offline RL or related references that I didn’t have time to cover, but I will likely modify this post in the future, especially given that I already have summary notes to myself on most of these papers (but they are not yet polished enough to post on this blog).

There is also extensive literature on off-policy evaluation, without necessarily focusing on policy optimization or deploying learned policies in practice. I did not focus on these as much since I wanted to discuss work that trains policies in this post.

I hope this post was helpful! As always, thank you for reading.

# Getting Started with Blender for Robotics

Blender is a popular open-source computer graphics software toolkit. Most of its users probably use it for its animation capabilities, and it’s often compared to commercial animation software such as Autodesk Maya and Autodesk 3ds Max. Over the last one and a half years, I have used Blender’s animation capabilities for my ongoing robotics and artificial intelligence research. With Blender, I can programmatically generate many simulated images which then form the training dataset for deep neural network robotic policies. Since implementing domain randomization is simple in Blender, I can additionally perform Sim-to-Real transfer. In this blog post, and hopefully several more, I hope to demonstrate how to get started with Blender, and more broadly to make the case for Blender in AI research.

As of today’s writing, the latest version is Blender 2.83, which one can download from its website for Windows, Mac, or Linux. I use the Mac version on my laptop for local tests and the Linux version for large-scale experiments on servers. When watching older videos of Blender or borrowing related code, be aware that there was a significant jump between Blender 2.79 and Blender 2.80. By comparison, the gap between versions 2.80 to 2.83 is minor.

Installing Blender is usually straightforward. On Linux systems, I use wget to grab the file online from the list of releases here. Suppose one wants to use version 2.82a, which is the one I use these days. Simply scroll to the appropriate release, right-click the desired file, and copy the link. I then paste it after wget and run the command:

wget https://download.blender.org/release/Blender2.82/blender-2.82a-linux64.tar.xz


This should result in a *.tar.xz file, which for me was 129M. Next, run:

tar xvf blender-2.82a-linux64.tar.xz


The v is optional and is just for verbosity. To check the installation, cd into the resulting Blender directory and type ./blender --version. In practice, I recommend setting an alias in the ~/.bashrc like this:

export PATH=${HOME}/blender-2.82a-linux64:$PATH


which assumes I un-tarred it in my home directory. The process for installing on a Mac is similar. This way, when typing in blender, the software will open up and produce this viewer:

The starting cube shown above is standard in default Blender scenes. There’s a lot to process here, and there’s a temptation to check out all the icons to see all the options available. I recommend resisting this temptation because there’s way too much information. I personally got started with Blender by watching this set of official YouTube video tutorials. (The vast majority have automatic captions that work well enough, but a few strangely have captions in different languages despite how the audio is clearly in English.) I believe these are endorsed by the developers, or even provided by them, which attests to the quality of its maintainers and/or community. The quality of the videos is outstanding: they cover just enough detail, provide all the keystrokes used to help users reproduce the setup, and show common errors.

For my use case, one of the most important parts of Blender is its scripting capability. Blender is tightly intertwined with Python, in the sense that I can create a Python script and run it, and Blender will run through the steps in the script as if I had performed the equivalent manual clicks in the viewer. Let’s see a brief example of how this works in action, because over the course of my research, I often have found myself adding things manually in Blender’s viewer, then fetching the corresponding Python commands to be used for scripting later.

Let’s suppose we want to create a cloth that starts above the cube and falls on it. We can do this manually based on this excellent tutorial on cloth simulation. Inside Blender, I manually created a “plane” object, moved it above the cube, and sub-divided it by 15 to create a grid. Then, I added the cloth modifier. The result looks like this:

But how do we reproduce this example in a script? To do that, look at the Scripting tab, and the lower left corner window in it. This will show some of the Python commands (you’ll probably need to zoom in):

Unfortunately, there’s not always a perfect correspondence of the commands here and the commands that one has to actually put in a script to reproduce the scene. Usually there are commands missing from the Scripting tab that I need to include in my actual scripts in order to get them working properly. Conversely, some of the commands in the Scripting tab are irrelevant. I have yet to figure out a hard and fast rule, and rely on a combination of the Scripting tab, borrowing from older working scripts, and Googling stuff with “Blender Python” in my search commands.

From the above, I then created the following basic script:

If this Python file is called test-cloth.py then running blender -P test-cloth.py will reproduce the setup. Clicking the “play” button at the bottom results in the following after 28 frames:

Nice, is it? The cloth is “blocky” here, but there are modifiers that can and will make it smoother.

The Python command does not need to be done in a “virtualenv” because Blender uses its own Python. Please see this Blender StackExchange post for further details.

There’s obviously far more to Blender scripting, and I am only able to scratch the surface in this post. To give an idea of its capabilities, I have used Blender for the following three papers:

The first two papers used Blender 2.79, whereas the third used Blender 2.80. The first two used Blender solely for generating (domain-randomized) images from cloth meshes imported from external software, whereas the third created cloth directly in Blender and used the software’s simulator.

In subsequent posts, I hope to focus more on Python scripting and the cloth simulator in Blender. I also want to review Blender’s strengths and weaknesses. For example, there are good reasons why the first two papers above did not use Blender’s built-in cloth simulator.

I hope this served as a concise introduction to Blender. As always, thank you for reading.

# Early Summer Update

Hello everyone! Here’s a quick early summer update. I had the last few days off from research since it’s the end of the semester and a few days before I begin my remote summer internship at Google Brain. During my time off, I added a new photo album on Flickr based on my trip to Vietnam for the International Symposium on Robotics Research (ISRR) in October 2019. The album has almost 200 photos from my iPhone. I also made minor updates to my older blog posts about ISRR, which you can access in the archives, to include some featured photos.

I wanted to finish the album because going to Vietnam was one of my last major trips before the COVID-19 pandemic, and it’s one that I especially cherish among my entire travel history, because it brought me to a place I knew little about beyond reading books and news about the tragic Vietnam War. That’s one of the benefits of travel. It opens our eyes to new areas and cultures.

I also updated my earlier photo albums for some of the other conferences I attended. First, I only recently realized that my photos were private. Whoops! They should be visible now judging from my tests logging out of Flickr and checking the albums. Second, I used the Flickr “Organizr” edit setting to rearrange photos from some earlier albums to get them in order based on when I actually took the photos on my iPhone. For the ISRR 2019 album, the photos are already in order since I figured out a better way to upload photos. On my laptop, I open the Photos app, group all the photos in an album within Photos (not to be confused with an album in Flickr), and then click “File –> Export –> Export Photos.” This will make a copy of the photos on the local file system in my laptop. From there, I use Flickr’s upload feature, and order the photos alphabetically, which fortunately means the photos are in order since they are named based on numbers.

I have several other actionable items on my agenda, but admittedly these may have to be pushed back by many months. One is to improve this website design. As explained here, the blog has looked like this for over five years, and I want to experiment with changes to make the website more visually appealing. The problem is backwards compatibility: I’d need the website changes to be able to retain all my LaTeX, all my code formatting, and inevitably this means re-reading over 300 posts from the last nine years. Let me know if you have any suggestions in that regard.

As always, thanks for reading this blog. In addition, I hope you are safe, and are able to stay indoors as much as possible if you have the privilege of doing so. I hope that life will return to normal in the near future.

# My Third Berkeley AI Research Blog Post

Hello everyone! My silence on this blog is because I was hard at work last month writing for another blog, the Berkeley AI Research (BAIR) Blog. Today, my collaborators and I just released a new post which describes our work in robotics and deformable object manipulation. As I’ve done with my past two BAIR Blog posts (here and here), I will mention a few words about it.

Our post is unusual in that it features papers from two different labs that didn’t formally collaborate on them. We feature four research papers in the post, two from Professor Pieter Abbeel’s lab and two from Professor Ken Goldberg’s lab. In case you’re wondering, no, we were not aware that we were working independently on these projects. I vividly remember submitting my fabric smoothing paper to arXiv back in September … and then, a few days later, seeing Lerrel Pinto (soon to be on the faculty at NYU) present us with results that were essentially what I had just showed in my paper! To be clear, it was a pleasant surprise, not an unwanted one. The more people working on the topic, the better.

Despite the focus on similar robotics tasks, the machine learning techniques we used were different. In fact, there’s an elegant, hierarchical way of categorizing our collective work. At the top, we have model-free versus model-based methods. They are further sub-divided into imitation learning versus reinforcement learning (for model-free methods) and image-space versus latent-space (for model-based methods). This neat split in our work fortunately made it easy for us to not only write this blog post – in the sense that the organization was clear from the start — but also to convey to the reader that there is no one way to approach a robotics problem. In fact, I would argue that the sign of being a true expert in one’s field is understanding the tradeoffs among various techniques that could, in theory, solve a certain problem.

I hope this post is an effective high-level introduction to the many ways we can approach robot learning problems.

In sum, here are the three BAIR Blog posts that I have written (comments are welcome):

All my posts took significant effort to write. I know I probably spend too much time blogging compared to what I “should” be doing as a typical PhD student, but I enjoy it too much to give it up. I plan to write at least one more blog post before graduation. At that point hopefully someone will magically appear out of thin air to take over the BAIR Blog maintenance duties from me …

As an extra bit of bonus information for reading my personal blog, here are some behind-the-scenes statistics about the BAIR Blog. First, let’s look at the number of subscribers:

Here, I show the growth in subscribers from May 2019 to April 2020. (We started the blog in July 2017.) At the time I took the screenshot, we have 5,878 subscribers. Of these, for any given email to subscribers to notify them of a new BAIR Blog post, about 41.0% will open the email, and then a further 6.8% of them will actually click on the link that we provide to the blog post. Not bad! I definitely think each BAIR blog post gets more attention than the average research paper.

Oh, and we have 536 subscribers that, for whatever reason, subscribed and then unsubscribed. What gives?!?

Now let’s switch over to page views, courtesy of Google Analytics. Here’s what I see when I list the countries of origin of our visitors, from the BAIR Blog’s entire history.

The United States is the clear leader here, with India and China the next two countries. If anything I’m surprised that the gap between the United States and India (or China) is that large. I think that Indian or Chinese citizens who access the blog while located in the United States get counted as a United States user. I’ll have to check how Google Analytics actually works here, but this seems to be the most logical conclusion.

The rest of the list also isn’t that surprising. Singapore and Hong Kong are showing, despite being the size of cities, that they have a large set of Artificial Intelligence enthusiasts.

In terms of demographics, the BAIR Blog audience is estimated to be about 85% male, 15% female, as shown below. I know, we’re trying to work on this. (I frequently email BAIR students and postdocs requesting for blog posts, and I do this slightly more towards females to at least balance out the authorship.)

Here’s what happens when I look at the most popular blog posts and the page views from the beginning of the blog:

The most popular blog post by far is Chelsea Finn’s post about Model Agnostic Meta Learning (MAML), the wildly popular meta-learning algorithm for enabling deep neural networks to rapidly adapt to new tasks. Incidentally, that algorithm was a key reason why Finn landed a faculty position at Stanford. Most of the other popular posts are about (deep) reinforcement learning, which continues to be a Berkeley specialty. My first two blog posts are somewhat farther down the list, with about 10,000 page views for each. That’s still a respectable amount of views.

Well, I hope that was an interesting behind-the-scenes look at the BAIR Blog. Say, I should probably contact the maintainers of the Stanford AI Blog and the CMU Machine Learning Blog to see how much we’re dominating them in terms of subscriber count and page views …

# Fully Convolutional Neural Networks for Fast and Reliable Robotic Manipulation

The figure above, from the TossingBot paper with caption included, shows an example of how to use fully convolutional neural networks for robotic manipulation.

Given the COVID-19 situation and the “shelter-in-place” order in the Bay Area, I have been working remotely the last few weeks. The silver lining is that, because I recently wrapped up a bunch of projects, I was already planning to use my Spring Break (which was last week) for brainstorming new research projects, which is more suitable for remote work, and I am fortunate that my job affords that opportunity. Part of the brainstorming process involves plowing through research papers on my never-ending “TODO” list. So while working at home during a pandemic has not been as good for me as it was for Sir Isaac Newton back then, it has not been terrible. I was able to read through three papers (and re-read one paper) about robotic manipulation using fully convolutional neural networks.

In particular, this blog post will discuss these four recent robotics papers, which I abbreviate as follows (see the bottom of the post for a full set of citations):

I already dissected Form2Fit in a prior blog post but I will revisit the paper as it is highly related to the first three. This blog post will compare and contrast the techniques used in these four papers.

The papers specialize in image-based robotic manipulation, where decisions are made on the basis of dense, per-pixel calculations. We call these “dense” operations because they compute something for every pixel in an input image. For an example of a concept that involves dense operations, see my recent blog post on dense object descriptors.

In order to efficiently perform dense per-pixel operations, the authors employ Fully Convolutional Neural Networks (FCNs). For a refresher on these, you can read the massively influential CVPR 2015 paper or perhaps look at resources such as Stanford’s CS 231n class. While FCNs were originally developed for semantic segmentation tasks, the papers I discuss here show how FCNs can be used for robotic manipulation.

Well, what are these papers about, and how do they use FCNs?

First, the Pick-and-Place paper focuses on picking out cluttered items from a bin. Their system employs several FCNs (as we’ll see, using several streams is common) to map from an image of a workspace (i.e., a bin of objects) to a value between 0 and 1, which is called an “affordance.” Numbers closer to 1 are better. Affordances should not be interpreted as a probability, even though I often think of them that way, because the training labels are not determined by measuring a probability of success, but by a relative scale labeled by human users. There are four action primitives: two for suctioning, and two for grasping, and the exact type used is not learned but hard-coded via surface normals (for suctioning) or location near a bin edge (for grasping). To handle grasp rotations, the authors simply discretize rotation into 16 groups by cleverly rotating the input RGBD images, and then passing all the images in parallel through the FCN. Interestingly, the Appendix reports other modeling architectures, such as $n$ separate FCNs, but that was sample inefficient and also challenging to load in GPU memory. While this isn’t the focus of my blog post, they interestingly do a pick first, then recognize framework, rather than the reverse which is probably more common. So, their robot picks the grasped object, and runs a separate neural network to recognize it. The predicted image class then tells the robot where to stow the object.

Second, the Pushing and Grasping Synergies paper investigates how to simultaneously learn pushing and grasping actions to pick items from a workspace and put them in an external bin. The reason for learning pushing (and not just grasping) is that they consider a workspace with objects situated next to each other, so that pushing first, then grasping, to isolate objects is often a better strategy than grasping alone. The system uses model-free deep Q-learning to train two FCNs, one for pushing and the other for grasping, and training is entirely self supervised: the authors cleverly set a system so that the robot can dump a box of objects on the workspace, and then tries pushing and grasping actions. Eventually, it trains the two Q-networks well enough that they can be deployed in scenarios with novel objects. The paper says just 5.5 hours of real-world data training is needed.

Third, the TossingBot paper investigates how to train a robot to throw arbitrary objects into target bins. Why do this, beyond generating cool videos? Throwing increases the range of the robot’s reachability and it may increase picking speed. The paper explores the synergy between grasps and throws, and jointly learns the two primitives so that the robot performs grasps that enable good throws. (It reminds me of the synergy between pushing and placing from the prior paper!) The throwing part uses the idea of residual physics. It learns a velocity magnitude conditioned on visual information, and then adds that to the output from an analytical physics model. That physics model helps to generalize to different target bins, and provides a reasonable initial velocity estimate. The estimate is then corrected from the learned model, because it is hard to model the forces of aerodynamic drag. The results and videos are truly impressive.

Fourth, the Form2Fit paper focuses on assembling kits together with robots. While my prior blog post covers this in detail, to summarize here again, robotic kit assembly is done with a sequence of picking and placing actions. The “picking” uses a suctioning action, and we need a good suctioning action as a pre-requisite to getting good placing actions. Both picking and placing are represented as FCNs. However, there is a third module, called a match network (also a FCN) which uses descriptors to indicate correspondence. Why? To associate a suction location on the object to a placing location, and to change the orientation. As I implied in my prior post, imagine we didn’t have the matching network. What would happen? Initially, given a grasped object, there are many ways we can place it successfully. But eventually we have to be able to assemble the entire kit, so each object must be inserted just at the right spot, and not just anywhere with high probability, so that subsequent actions can correctly fill up the kit.

So, to recap, here’s the desired output of the FCNs, assuming that they have been sufficiently trained:

• Pick-and-Place: affordance (not probability) values for suctioning and grasping action primitives. Affordance values are bounded within $[0,1]$, and higher numbers are better.

• Grasping and Pushing Synergies: $Q(s,a)$ values, or the discounted sum of future rewards at this given image $s$ and taking action primitive $a$, under the robot’s target (not behavioral) policy.

• TossingBot: the output of the grasping network is the probability of “grasping success” when grasping at any particular pixel. Be aware that the training signal depends on the subsequent throwing success. The throwing network, interestingly, outputs the desired velocity residual which is added to an initial velocity estimate from an analytical physics model.

• Form2Fit: the output of the suction and place networks are the probabilities of the respective actions. The output of the match network is the dense object descriptor representation, which is of the same height and width as the input image but with a higher dimension, as they used $d=64$ channels. This is used to indicate correspondence among the suction and place actions.

In order to make those FCNs output desired values, we need to train them. How does the process of collecting labels and training work for each method?

• Pick-and-Place: skilled and experienced human users must manually label the affordances. Thus, this is the only paper among the four here that does not employ automatic data labeling via trial and error. The human manually labels pixels as positive, negative, or neither, and then pixels with neither are trained with a loss value of 0 via backpropagation. The authors had to design an interface to make this feasible, and it has to be sparsely labeled to make this practical. The training data consists of fewer than 2000 of these manually labeled images, though this is surely before data augmentation. Interestingly, 2000 is roughly on the order of how many images I had for our bed-making paper.

• Grasping and Pushing Synergies: the labels are implicit through reinforcement learning rewards. Their reward design is simple: a $+1$ for a successful grasp and a $+0.5$ for a successful push that “meaningfully changes the scene” — the latter requires a hard-coded threshold. Through model-free reinforcement learning and backpropagation, the FCNs updates the parameters such that their output computes the learned value function.

• TossingBot: the robot collects data through trial-and-error, and the videos show how the system is set to be self-supervised to keep human intervention at a minimum. The grasping network is trained with throwing success, not grasping success. This is critical because the whole point of grasping is to enable good throws! Therefore, when I say “grasping success probability” it really should be interpreted as “probability that this grasp will be successful for a subsequent throw.” They automatically get this label by checking if the grasped image landed in the target box. For the throw, we first get the analytical estimate $\|\hat{v}_{x,y}\|$ from physics equations conditioned on a known target spot. Then, we get the actual landing spot from overhead cameras, which I assume are similar to the ones for detecting throwing success, and can deduce the true residual from that.

• Form2Fit: the data collection here is a bit subtle, and covered in depth in my prior blog post. It’s clever and involves reversing the task, i.e., disassembly. It is easier to disassemble than assemble, and by doing this, the robot gets data points for training the picking and placing modules, and then training the dense object net to get the match module. Once again for a grasp point or suction point, we take a single pixel (actually, a radius around it) and then backpropagate through it.

Now that we have the FCNs, what actions should the robot take at each time step? This is generally straightforward once the FCNs have done their heavy duty task in getting per-pixel image numbers:

• Pick-and-Place: given all the possible action primitives along with all the rotations, pick the single pixel with the highest affordance value, and execute that action. This involves a maximum operation over every single image output from the FCN (including a factor of 16x for rotations), and then a second maximum over pixels in them. That’s the idea, but in practice they employed some heuristics. One is “suction first then grasp,” which led them to artificially scale the suctioning affordance values. Another one is that if the robot repeatedly tries an action but does not affect the scene — a problem I’ve experienced in several research projects — then they decrease the affordances of the relevant pixels. It’s these little things that, though somewhat hacky, help maximize performance.

• Grasping and Pushing Synergies: the action chosen is one that maximizes the Q-values. In other words, take the maximum over all the 32 possible images (16 for grasping, 16 for pushing) over all pixels within those images. That’s a lot to consider, but the computation is parallelized.

• TossingBot: the pixel with the highest grasp probability (from the output of the grasping module) across all orientations is chosen for the grasp point. Then, the robot will toss using the corresponding predicted velocity, which is provided in the same pixel location and same orientation in the output image of the throwing module.

• Form2Fit: the planner first samples a set of potential actions. It then uses the descriptors to see which pick-and-place pair has the lowest L2 distance in descriptor space, and chooses that action. This “minimize distance in descriptor space” is standard for many of the robotics and descriptors papers I read nowadays. It can be expensive to sample and evaluate so many actions, so it is necessary to tune the sampling frequency.

Overall, what do these papers suggest as the advantages of the FCN-based approach?

• The technique is object-agnostic in that it does not make any assumptions about the kind of objects the robot might grasp.

• FCNs are efficient for per-pixel calculations, and this is helpful when we want a label for every pixel in an input image. In addition, the resulting action is often a simple function of the FCN output, such as taking an “argmax” across the pixels, as mentioned earlier. Some other alternatives for data-driven robotic grasping, as covered in an earlier blog post, require sampling a set of image patches or running the Cross Entropy Method.

• Their specific architectural choice of rotating the input image by 16, to represent 16 different rotations, means they do not need to consider rotation as part of the action, simplifying the primitive. In addition, by keeping the different rotations in one architecture, rather than splitting into 16 different networks or 16 different trunks, they can use weight sharing to improve generalization and training efficiency.

• Since the output is of the same dimension of the input with per-pixel properties, one can debug and/or interpret the output by looking at a heat map to see which values are higher.

There is other work that uses FCNs for efficient grasping, such as one that came right out of our own AUTOLAB and was presented at ICRA 2019. That paper, interestingly, trained a Convolutional Neural Network and then converted it to a Fully Convolutional Neural Network, to avoid the manual labeling done in the Robotic Pick-and-Place paper.

If you are interested in learning how to accelerate training of affordance-based policies with FCNs, I refer you to an ICRA 2020 paper which argues for the benefits of visual pre-training based on passive data without robotic interaction. This means the subsequent fine-tuning on active data from interaction is significantly shorter.

Overall it seems like FCNs are a powerful ingredient in the machine learning and robotics toolbox, and can be combined with techniques such as reinforcement learning, dense object descriptors, self-supervision, and other techniques.

Here are the full citations of the papers I discussed:

Thank you for reading, and stay safe.

# My Interview with PyImageSearch's Sayak Paul

I’m pleased to share that my interview with Sayak Paul, who works at PyImageSearch, is now available to read over at his Medium blog. Here’s how he introduces me:

A warm welcome to Daniel Seita for today’s interview. Daniel is a computer science Ph.D. student at the University of California, Berkeley. His research interests broadly lie in areas like Artificial Intelligence, Robotics, and Deep Learning. He is deeply passionate about explaining technical insights and one such favorite insight of mine from Daniel’s archive is Understanding Generative Adversarial Networks. You can check out all of his blog pieces from here. He writes on a wide range of topics and has written more than 300 such pieces.

I was approach by Paul with a cold email, and agreed to do the interview for a number of reasons:

• I am honored that my blog posts have provided him insights.
• I was impressed by the wide range of inspiring people who Paul previously interviewed.
• I wanted to indirectly provide more support to PyImageSearch because that website has been a tremendously helpful resource for my research over the last few years.

To expand on the last point, PyImageSearch is incredible, filled with tutorial after tutorial in such plain-spoken, clear language. I typically use it as a reference on using OpenCV to adjust or annotate images, but PyImageSearch is also helpful for Deep Learning more broadly. For example, literally yesterday, I was learning how to write code using TensorFlow 2.0 with the new eager execution (I usually use PyTorch). As part of my learning process, I read the PyImageSearch articles on keras versus tf.keras and how to use the new tf.GradientTape feature. I have not had to pay anything to read these awesome resources, though I would be willing to do so.

As I mentioned earlier, I hope you enjoy the interview. Inspired by the interview, I am working hard on blog posts here, to be released in the next few months. It’s Spring Break week now, and unlike last year when I was a teaching assistant for Berkeley’s Deep Learning class and needed to use Spring Break to catch up on research and other things, this time I’m mostly taking a breather from an intense research semester thus far.

# Thoughts After Using rlpyt For Several Months

Over the past few months, I have frequently used the open-source reinforcement learning library rlpyt, to the point where it’s now one of the primary code bases in my research repertoire. There is a BAIR Blog post which nicely describes the rationale for rlpyt, along with its features.

Before rlpyt, my primary reinforcement learning library was OpenAI’s baselines. My switch from baselines to rlpyt was motivated by several factors. The primary one is that baselines is no longer actively maintained. I argued in an earlier blog post that it was one of OpenAI’s best resources, but I respect OpenAI’s decision to prioritize other resources, and if anything, baselines may have helped spur the development of subsequent reinforcement learning libraries. In addition, I wanted to switch to a reinforcement learning library that supported more recent algorithms such as distributional Deep Q-Networks, coupled with perhaps higher quality code with better documentation.

Aside from baselines and rlpyt, I have some experience with stable-baselines, which is a strictly superior version of baselines, but I also wanted to switch from TensorFlow to PyTorch, hence why I did not gravitate to stable-baselines. I have very limited experience with the first major open-source DeepRL library, rllab, which also came out of Berkeley, though I never used it for research as I got on the bandwagon relatively late. I also used John Schulman’s modular_rl library when I was trying to figure out how to implement Trust Region Policy Optimization. More recently, I have explored rlkit for its Twin-Delayed DDPG implementation, along with SpinningUp to see cleaner code implementations.

I know there are a slew of other DeepRL libraries, such as Intel’s NervanaSystems coach which I would like to try due to its huge variety of algorithms. There are also reinforcement learning libraries for distributed systems, but I prefer to run code on one machine to avoid complicating things.

Hence, rlpyt it is!

# Installation and Quick Usage

To install rlpyt, observe that the repository already provides conda environment configuration files, which will bundle up the most important packages for you. This is not a virtualenv, though it has the same functional effect in practice. I believe conda environments and virtualenvs are the two main ways to get an isolated bundle of python packages.

On the machines I use, I find it easiest to first install miniconda. This can be done remotely by downloading via wget and running bash on it:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
// after installing ...
. ~/.bashrc  // to get conda commands to work
// to ensure (base) is not loaded by default
conda config --set auto_activate_base false
. ~/.bashrc  // to remove the (base) env


In the above, I set it so that conda does not automatically activate its “base” environment for myself. I like having a clean, non-environment setup by default on Ubuntu systems. In addition, during the bash command above, the Miniconda installer will ask this:

Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes


I answer “yes” so that it gets initialized.

After the above, I clone the repository and then install with this command:

This will automatically make a new conda environment, specialized for Linux with CUDA 10 for the command above. Then, finally, don’t forget:

to make rlpyt a package you can import within your conda environment, and to ensure that any chances you make in rlpyt will be propagated throughout your environment without having to do another pip install.

For quick usage, I follow the rlpyt README and use the examples directory. There are several scripts in there that can be run easily.

# Possible Workflow

There are several possible workflows one can follow when using rlpyt. For running experiments, you can use scripts that mirror those in the examples directory. Alternatively, for perhaps more heavy-duty experiments, you can look at what’s in rlpyt/experiments. This contains configuration, launch, and run scripts, which provide utility methods for testing a wide variety of hyperparameters. Since that requires me to dive through three scripts that are nested deep into rlpyt’s code repository, I personally don’t follow that workflow; instead I just take a script in the examples directory and build upon it to handle more complex cases.

Here’s another thing I find useful. As I note later, rlpyt can use more CPU resources than expected. Therefore, particularly with machines I share with other researchers, I limit the number of CPUs that my scripts can “see.” I do this with taskset. For example, suppose I am using a server with 32 CPUs. I can run a script like this:

and this will limit the script to using CPUs indexed from 21 to 31. On htop, this will be CPUs numbered 22 through 32, as it’s one-indexed there.

With this in mind, here is my rough workflow for heavy-duty experiments:

• Double check the machine to ensure that there are enough resources available. For example, if nvidia-smi shows that the GPU usage is near 100% for all GPUs, then I’m either not going to run code, or I will send a Slack message to my collaborators politely inquiring when the machine will free up.

• Enter a GNU screen via typing in screen.

• Run conda activate rlpyt to activate the conda environment.

• Set export CUDA_VISIBLE_DEVICES=x to limit the experiment to the desired GPU.

• Run the script with taskset as described earlier.

• Spend a few seconds afterwards checking that the script is running correctly.

There are variations to the above, such as using tmux instead of screen, but hopefully this general workflow makes sense for most researchers.

For plotting I don’t use the built-in plotter from rlpyt (which is really from another code base). I keep the progress.csv file and download it in a stand-alone python script for plotting. I also don’t use TensorBoard. In fact, I still have never used TensorBoard to this day. Yikes!

# Understanding Steps, Iterations, and Parallelism

When using rlpyt, I think one of the most important things to understand is how the parallelism works. Due to parallelism, interpreting the number of “steps” an algorithm runs requires some care. In rlpyt, the code frequently refers to an itr variable. One itr should be interpreted as “one data collection AND optimization phase”, which is repeated for however many itrs we desire. After some number of itrs have passed, rlpyt logs the data by reporting it to the command line and saving the textual form in a debug.log file.

The data collection phase uses parallel environments. Often in the code, a “Sampler” class (which could be Serial-, CPU-, or GPU-based) will be defined like this:

(The examples folder in the code base will show how the samplers are used.)

What’s important for our purposes is batch_T and batch_B. The batch_T defines the number of steps taken in each parallel environment, while batch_B is the number of parallel environments. Thus, in DeepMind’s DQN Nature paper, they set batch_B=1 (i.e., it was serial) with batch_T=4 to get 4 steps of new data, then train, then 4 new steps of data, etc. rlpyt will enforce a similar “replay ratio” so that if we end up with more parallel environments, such as batch_B=10, it performs more gradient updates in the optimization phase. For example, a single itr could consist of the following scenarios:

• batch_T, batch_B = 4, 1: get 4 new samples in the replay buffer, then 1 gradient update.
• batch_T, batch_B = 4, 10: get 40 new samples in the replay buffer, then 10 gradient updates.

The cumulative environment steps, which is CumSteps in the logger, is thus batch_T * batch_B, multiplied by the number of itrs thus far.

In order to define how long the algorithm runs, one needs to specify the n_steps argument to a runner, usually MinibatchRl or MinibatchEval (depending on whether evaluation should be online or offline), as follows:

Then, based on n_steps, the maximum number of itrs is determined from that. Modulo some rounding issues, this is n_steps / (batch_T * batch_B).

In addition, we use log_interval_steps to represent the itr interval when we log data.

# Current Issues

I have been very happy with rlpyt. Nonetheless, as with any major open-source code produced by a single PhD student (named Adam), there are bound to be some little issues that pop up here and there. Throughout the last few months, I have posted five issue reports:

• CPU Usage. This describes some of the nuances regarding how rlpyt uses CPU resources on a machine. I posted it because I was seeing some discrepancies between my intended CPU allocation versus the actual CPU allocation, as judged from htop. From this issue report, I started prefacing all my python scripts with taskset -c x-y where x and y represent CPU indices.

• Using Atari Game Scores. I was wondering why the performance of my DQN benchmarks were substantially lower than those I saw in DeepMind’s papers, and the reason was due to reporting clipped scores (i.e., bounding values within $[-1,1]$) versus the game scores. From this issue report, I added in AtariTrajInfo as the “trajectory information” class in my Atari-related scripts, because papers usually report the game score. Fortunately, this change has since been updated to the master branch.

• Repeat Action Probability in Atari. Another nuance with the Atari environments is that they are deterministic, in the sense that taking an action will lead to only one possible next state. As this paper argues, using sticky actions helps to introduce stochasticity into the Atari environments while requiring minimal outside changes. Unfortunately, rlpyt does not enable it by default because it was benchmarking against results that did not use sticky frames. For my own usage, I keep the sticky frames on with probability $p=0.25$ and I encourage others to do the same.

• Epsilon Greedy for CPU Sampling (bug!). This one, which is an actual bug, has to do with the epsilon schedule for epsilon greedy agents, as used in DQN. With the CPU sampler (but not the Serial or GPU variants) the epsilon was not decayed appropriately. Fortunately, this has been fixed in the latest version of rlpyt.

• Loading a Replay Buffer. I thought this would be a nice feature. What if we want to resume training for an off-policy reinforcement learning algorithm with a replay buffer? It’s not sufficient to save the policy and optimizer parameters, as in an on-policy algorithm such as Proximal Policy Optimization, because we need to reproduce the exact contents of the replay buffer at the point when we saved the training state.

Incidentally, notice how these issue reports are designed so that they are easy for others to reproduce. I have argued previously that we need sufficiently detailed issue reports for them to be useful.

There are other issue reports that I did not create, but which I have commented on, such as this one about saving snapshots, that I hope are helpful.

Fortunately, Adam has been very responsive and proactive, which increases the usability of this code base for research. If researchers from Berkeley all gravitate to rlpyt, then it provides additional benefits for using rlpyt, since we can assist each other.

# The Future

I am happy with using rlpyt for research and development. Hopefully it will be among the last major reinforcement learning libraries I need to pick up for my research. There is always some setup cost to using a code base, but I feel like that threshold has passed for me and that I am at the “frontier” of rlpyt.

Finally, thank you Adam, for all your efforts. Let’s go forth and do some great research.

# More On Dense Object Nets and Descriptors: Applications to Rope Manipulation and Kit Assembly

In a prior blog post, I reviewed two papers about dense object descriptors in the context of robotic manipulation. The first paper, at CoRL (Florence et al., 2018), introduced it for object manipulation and open-loop grasping policies. The second paper, to appear at RA-Letters and ICRA (Florence et al., 2020), used descriptors and correspondence for policy optimization. In this post, I will discuss how descriptors can be used for two different robotics applications: rope manipulation and kit assembly. We can additionally combine descriptors with other tools in robotics such as imitation learning and self-supervision, which these papers demonstrate.

Before reading this post, I highly recommend going through the 30-minute PyTorch tutorial associated with the CoRL 2018 paper. I did not know anything about descriptors before reading the CoRL 2018 paper last year, and I appreciate the efforts of the authors to help us quickly learn the relevant concepts.

As a quick refresher on terminology, I refer to dense object nets as the networks which have descriptors as their output. They are “dense” because they involve predicting something at every pixel of an image. Don’t worry, this is not done by iterating through each target pixel (my brain hurts just thinking about doing that) but by passing the full image through the net and getting all the labels on each pixel in parallel.

## Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data

There is a whole sub-field of robotics that deals with rope manipulation. This paper, which recently came out of our lab at UC Berkeley, applies dense object descriptors for rope manipulation. They show, among other things, that descriptors can be applied to highly deformable objects. Previously, (Florence et al., 2018) applied it on slightly deformable objects, such as hats and shoes.

Another interesting aspect of this paper is that the authors train dense object nets in simulation. This provides perfect information of rope, so given two images of the same rope in different configurations, it should be possible to provide exact correspondences among pixels of the ropes. The paper argues that because rope is highly deformable, it is not sufficient to just change the pose of the camera to learn object descriptors, as was done in the earlier CoRL 2018 paper which used multiple camera views. I believe the CoRL paper needed to get multiple camera views for their full 3D reconstruction of the objects under consideration.

Blender is the simulator used in the paper. I know Blender reasonably well as we have recently used it for fabric manipulation (Seita et al., 2019). The below image shows a visualization of the simulator used in the work (left two columns).

The third image shows a simulated depth image of the rope, where pixels are a height value from an overhead camera. The fourth image shows that we can define an ordering of points on the rope, where points close to the ball are closer to yellow, and the colors change as one “traverses” away from the rope. A couple of pointers:

• The simulator produces depth images, which may help in sim-to-real transfer since depth is naturally invariant to colors. We have been using depth for a lot of our papers, as we show in our 2018 BAIR Blog post. In addition to standard domain randomization techniques, the authors perform several tricks on the images of rope to make it look similar to the noisier depth images we encounter in practice.
• Regarding the color ordering on the rope, the goal in training a dense object net is to generate descriptors such that if we translate the descriptor values into pixels, we get a consistent color ordering among the same rope but in different configurations. All that matters is the relative ordering of colors. We don’t care if the descriptor network happens to “decide” that points closer to the ball are blue instead of yellow, so long as that “decision” is consistent among different images.
• There is a ball attached to one end of the rope, which is needed to enforce a notion of ordering among the pixels. Otherwise, there would be two possible orderings, which might fool a descriptor net. Indeed, the ablation studies show that this ball is perhaps the most important hyperparameter decision the authors made.

That was the simulator. We have to use it to get data to train the dense object network. The authors do this by sampling to get some rope state $\xi_1$. Then, they apply a random transformation to get $\xi_2$. This is essentially a robot’s action, defined as a pick and place transform. The pair is then used as a training data, where the goal is to train the dense object net to make corresponding points in $\xi_1$ and $\xi_2$ to be close to each other, while encouraging non-corresponding points to be further apart. The training loss is done in the same manner as in the CoRL 2018 paper so please read that paper for the exact loss function, which I also dissect in my prior post.

Here is a visualization of what descriptors learn:

The first and third images show synthetic depth images of the rope in different configurations, and the second and fourth show visualizations of the corresponding dense object net outputs. Again, don’t get too caught up by the exact colors; all that matters is that they are consistent across the two images, and indeed they are! The process of generating these color images usually involves normalization techniques such as scaling the pixel values to be within $[0,255]$. In this paper, the descriptor dimension is 3, which makes it easy to visualize images.

You will also see that intersections and occlusions can be tricky with descriptors, since it may be impossible to get truly exact correlations; they would be restricted to pixels appearing at the uppermost layer of the object(s). The paper measures the uncertainty of descriptor nets and reports that, as expected, uncertainty is highest at intersections and occlusions.

The learned descriptors above are interesting, but now how do we use them in practice for robot manipulation? We need some benefit from descriptors, otherwise why we would use them? The paper reports two sets of experiments:

• One-Shot Visual Imitation. No, don’t get confused with my post of a similar title, that was meta-learning, and here there is no meta-learning. The terminology means the robot is provided only one demonstration of a task to complete, where the demonstration is a sequence of images of rope states. The goal is to sequentially take actions to reach each of the images, or “sub-goals” if you prefer, in order. This is the same problem setting as in (Nair et al., 2017) – just think of it as requiring a demonstration at test time.

The policy is a greedy action: it uses descriptors from the current and (sub)goal images. From these, they sample paired points on the rope. They then look at the descriptor values, and find which pairing of sampled points is furthest from each other, and take a pick-and-place action to correct that. Intuitively, doing this each time gets the rope closer to the goal state because the greedy action has handled the most “distant” set of points. Assuming that actions do not cause any other descriptor pairs to increase in distance (a huge assumption!!) then eventually the rope has to look the same as in the human demonstration images.

• Descriptor Parameterized Knot Tying. This is more specific for knot-tying, and uses a two-action sequence tuned towards a specific knot type. Thus, for another kind of knot they’d have to redefine the trajectory (and assume we already know how to do it) but there is no free lunch. They fix the actions for one rope, but here’s the clever part: they record the action vectors, but then “translate” that into descriptor space by passing it through the dense object net. This is what they mean by “defining an action in terms of descriptors.” Then, for a new goal image, since we already have the descriptors, we can use the original descriptor and map it into the corresponding pixels for the new goal image. We get the complete action by doing this for the pick and the place components. Thus, the action is generalizable across images.

For both experiments, they use a YuMi robot. For the former, they try and get the YuMi to manipulate the rope so it reaches some target, which they can measure with Intersection over Union (IoU). For the latter, they perform 50 knot-trying trials and report 66% success rates, out-performing prior work, but the caveat of course is that the experimental setup is not the same. I encourage you to visit the project website to see some videos.

There are also a set of simulated experiments that show extensive ablations over various perturbations of parameters. (If anything, I think there’s too many ablations and not enough focus on the robot experiments, but that’s probably a minor comment given the overall high quality of the paper.) The summary of the results is that descriptor quality, as measured on a held-out test set of images, is insensitive to a variety of parameters, with the exception of including a ball on one end or not. That is perfectly acceptable and reasonable.

To conclude, the advantages of the approach presented in the paper are that it uses depth and simulation to avoid the need for running real robots as in (Nair et al., 2017), and that the descriptors provide correspondence, allowing us to define interpretable, geometric actions. By that, I mean we can take a pixel location of a grasp point on a robot, and use descriptors to map that point to other rope configurations.

## Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly

This paper uses descriptors for a very different application: assembling kits together. The first author, Kevin Zakka, already has a nice blog post about the paper, so my post will try and dive more into the technical details.

Kit assembly is deliberately a broad topic, and applies to basically anything that involves packaging something. By using descriptors and machine learning, they can learn picking and placing actions which generalize to assembling other kits not seen in training. They argue that in assembly lines, kits may change every few weeks, motivating learning over hard-coded rules. I can see why Google might have wanted to do this because they might work with companies that have assembly lines.

My first reaction upon understanding the kit assembly task was: great, this is cool, and a problem that I wish I had thought about earlier, but how does one get data on assembling kits? That seems much harder to do in simulation or the real world compared to rope manipulation.

The authors cleverly get data by dis-assembly from complete kits, and then repeating the process in reverse to assemble complete kits, in a manner similar to time-reversal as self-supervision. Even if actions are not truly reversible, such as with a placing operation that displaces existing objects, it seems logical that this helps get more high-quality data since it is intuitively harder to assemble than to dis-assemble. Since the paper does not use simulators, the downside is that a human would have to first provide an assembled kit and then maybe manually assemble things should something go wrong in data collection. As long as this does not need to happen too frequently, then it is acceptable. They report that they need just 500 disassembly sequences, though this is per training kit (to be fair, there are not many training kits). That’s roughly on the order of how many data points I had to physically collect for our bed-making paper from ISRR 2019.

Here is an overview of the pipeline, caption included from the paper:

They use three fully convolutional neural networks in the pipeline. Recall that fully convolutional networks, which were introduced through a monumentally impactful paper from Trevor Darrell’s group a few years ago at Berkeley, are those that use only convolutional layers and efficiently perform dense per-pixel operations by mapping an image of size $(H\times W\times c_1)$ to another one of size $(H\times W\times c_2)$. Thus, all three networks produce per-pixel predictions of something with respect to the input image.

For kit assembly, the action space consists of a pick $p$, a place $q$, and an orientation for placement $\theta$. In addition, $p$ and $q$ are image pixels, which are then converted to a coordinate with respect to the robot’s base frame. The UR5 robot they use applies suctions, which reminds me of Jeff Mahler’s suctioning paper from ICRA 2018.

Interestingly, all three networks use depth images, like the rope manipulation paper above. However, the authors also use grayscale images and concatenate it with the depth images, producing “Grayscale-Depth” images (and not “RGB-Depth” images). I wonder why we don’t see more grayscale since that may reduce the need for heavy color-based domain randomization or additional training data?

The authors split the workspace into two images, one for showing the kit $I_{\rm kit}$ and the other for showing the objects $I_{\rm obj}$ which are initially scattered around and must be assembled in the kit.

Now let’s review the details of the three networks, which are called the suction, placing, and matching modules.

Suction module. For each pixel in $I_{\rm obj}$, this determines the success probability of grasping (i.e., suctioning) something.

• Getting labels is straightforward. The robot can measure the “airflow” of its suction gripper. For a given grasp point pixel $p$, if the airflow shows a success, then from the input image, we must encourage the suction network to assign pixel $p$ as a success. This is only one pixel out of many, so in practice the authors end up artificially increasing a radius about $p$ and labeling those a success. Notice that (a) sometimes we may get failures, so we’d do the same as earlier except assign a failure, and (b) other pixels backpropagate with zero loss. They do NOT assign other pixels as failures, because we don’t know if suctioning at other pixels far from $p$ could indeed lead to picking up something.
• The loss function uses the binary cross entropy loss, i.e., success or failure, for the pixels that were grasped, including those nearby as I mentioned earlier. Interestingly, the authors combine this with a “dice” loss. You can read the technical details in the paper but to summarize I believe it’s used to address class imbalance. For Form2Fit, I think because of the author’s setup, most of the suctions will be a success, and hence training is dominated by “pixel $p$ in a given image is a good pick point” rather than “pixel $p$ in a given image is a bad pick point.”
• Finally, how does the data collection work from the time reversal? It’s pretty clever. First, when we disassemble, at each time we are given an image $I_{\rm kit}^{(t)}$ and apply suctioning on point $p^{(t)}$, where here I add the $t$ superscript to represent time. Notice that this is not the same as what happens during test time, where we must apply suctioning on images of objects, i.e., $I_{\rm obj}$ — but we can think of this as a clever form of data augmentation. Thus, the dis-assembly gives us a sequence of data which includes both picking from observations of the kit and placing where the objects will be during test time:

$\{ (I_{\rm kit}^{(1)}, p^{(1)}), (I_{\rm obs}^{(1)}, q^{(1)}), (I_{\rm kit}^{(2)}, p^{(2)}), (I_{\rm obs}^{(2)}, q^{(2)}), \ldots \}$

Then, during the assembly process, we apply actions in reverse, this time looking at images $I_{\rm obj}^{(t)}$ at each time step, but with the placing action from earlier as the new suctioning action!

Place module. This network figures out a placing pixel into $I_{\rm kit}$, under the assumption that we are suctioning something from the suction network. A key design decision is that they discretize the angle into 20 groups, so there are 20 images passed through the placing network in parallel. Again, this is per-pixel, so for every pixel, there is a value that tells us the probability of placing success. Their deoderant kit example also shows how the placing module implicitly encodes ordering conditioned on the input image. The training process is similar to the suctioning network, with the exception that there isn’t a notion of getting a success signal via measuring something like suction airflow.

• The loss function also uses the cross entropy and a dice loss.
• For every pixel in $I_{\rm kit}$, we need to train the net so that it shows high success for successful places, and low success for failures. To get data, we once again use the time reversal sequence from above. Precisely, the labels are the suction location $p$ at time $t$ and the heightmap $I_{\rm kit}$ at $t+1$. Intuitively this is because if we do the sequence in reverse, we will have $I_{\rm kit}$ as the target with location $p$ as our placing point, i.e., $q$. These are the “success labels” since we assume that the suction step from the disassembly was a success, which seems reasonable since the authors can command the robot to grasp at “reasonable” coordinates on the kit.

Match module. This is the most interesting one to me because it uses descriptors. But first, why do we need this if we already have picking and placing? They argue:

While the suction and placing modules provide a list of candidate picking and placing locations, the system requires a third module to 1) associate each suction location on the object to a corresponding placing location in the kit and 2) infer the change in object orientation. This matching module serves as the core of our algorithm, which learns dense pixel-wise orientation-sensitive correspondences between the objects on the table and their placement locations in the kit.

This makes sense. What would happen if we did not have this network, and only relied on the placing network? It still has a set of 20 rotations as input, so I wonder what happens if we just take the highest probability among all pixels in all 20 images to satisfy (2)? I definitely agree, though, that we need a way to do (1) to get correspondence, because different objects should be placed at different locations.

We have $f: I \in \mathbb{R}^{H\times W\times 2} \to \mathbb{R}^{H\times W \times d}$. In this paper, the descriptor dimension is $d=64$. That is super large compared to the other paper on rope manipulation, and compared to the work from Russ Tedrake’s group. I’m surprised it is that high but I am sure the authors did extensive testing on the descriptor dimension, which they report in the supplementary material. It is a Siamese network with two fully convolutional residual streams, each sharing the same weights (since that’s what “Siamese network” means). The kit image $I_{\rm kit}$ maps to 20 separate descriptors, each of which are 64-dimensional, and one of them is selected to inform the change in rotation via:

$\theta = \frac{360}{20} \times \arg\min_j \| \mu_{\rm kit}^{ij} - \mu_{\rm obj}^i\|_2^2 \quad {\rm for} \;\; j \in \{1,2,\ldots, 20\},\; i \in \{1,2,\ldots, H\times W\}$

The superscript of $j$ means we take one of the 20 descriptor images, so both descriptor images above, $\mu_{\rm kit}^j$ and $\mu_{\rm obj}$, are of dimension $(H\times W\times d)$. Then, we add the superscript of $i$ to represent a single pixel within those images, one of $H\times W$ candidate pixels. This way, we consider the best pixel match among all possible kit-object descriptor images. Finally, the $360/20$ fraction scales the index $j$ appropriately.

Now, how can we train the matching network to encourage similarity in both correspondence between picking and placing, and also the rotation? The loss function itself is the same as the one used in the CoRL 2018 paper, meaning that we need to sample matches and non-matches at the pixel level. The matches are taken from image pairs $(I_{\rm kit}, I_{\rm obj})$ where the kit image must be of the correct rotation (out of 20). Non-matches can be sampled from any of the 20 kit images. Within any pair of images, the pixel correspondences are labeled via object masks, which assumes that the rotation angle $\theta$ can provide us with the label of every pixel in the kit cavity and the corresponding pixel on the object, which is pulled outside the kit through data collection. This should work, particularly because the authors fix the kit to the surface; if that weren’t the case it might be harder to label correspondences.

Once the three networks are trained, the policy comes from the planner. It samples potential actions and then uses descriptors to see which pick-and-place pair (in descriptor space) have the lowest L2 distance, and that’s the action. Like with the rope manipulation paper, the policy is generally simple to describe and involves minimizing some distance in descriptors.

They conduct experiments using a physical UR5 robot, and evaluate by calculating the percentage of times when objects are placed into their target locations. I wonder if this involves some subjective interpretations, because I can imagine (and I see from the videos) that some objects might be almost but not quite inserted. As long as they are consistent with their interpretation, it is probably fine. The experiments show a number of promising results and effectiveness in assembling kits, with generalization to initial conditions of kits, and even to new kits entirely. They wrap up the results with a t-SNE visualization. Overall, I was really impressed with these results. Once again I encourage you to go to the project website for videos for a better understanding.

## Conclusion

Hopefully this gives a readable overview of two different applications of dense object descriptors, showcasing the versatility of the technique. To be concrete, here are the papers I covered in this and my prior post, along with the original ICRA 2017 paper:

Just like combining imitation learning and reinforcement learning or using simulators effectively with self-supervision, I think descriptors for correspondence belong in the toolkit we should use to develop general-purpose robots.

# My PhD Qualifying Exam (Transcript)

To start off my 2020 blogging, here is the much-delayed transcript of my PhD qualifying exam. The qualifying exam is a Berkeley-wide requirement for PhD students, and varies according to the department. You can find EECS-specific details of the exam here, but to summarize, the qualifying exam (or “quals” for short) consists of a 50-60 minute talk to four faculty members who serve on a “quals committee.” They must approve of a student’s quals talk to enable the student to progress to “candidacy.” That’s the point when, contingent on completion of academic requirements, the student can graduate with approval from the PhD advisor. The quals is the second major oral exam milestone in the Berkeley EECS PhD program, the first of which is the prelims. You can find the transcript of my prelims here.

The professors on my qualifying exam committee were John Canny, Ken Goldberg, Sergey Levine, and Masayoshi Tomizuka.

I wrote this transcript right after I took this exam in April of 2018. Nonetheless, I cannot, of course, guarantee the exact accuracy of the words uttered.

## Scheduling and Preparation

During a meeting with Professor Canny in late 2017, when we were discussing my research progress the past semester, I brought up the topic of the qualifying exam. Professor Canny quickly said: “this needs to happen soon.” I resolved to him that it would happen by the end of the spring 2018 semester.

Then, I talked with Professor Goldberg. While seated by our surgical robot, and soon after our ICRA 2018 paper was accepted, I brought up the topic of the quals, and inquired if he would be on my committee. “It would be weird if I wasn’t on the committee” he smiled, giving approval.1 “Will it be on this stuff?” he asked, as he pointed at the surgical robot. I said no, since I was hoping for my talk to be a bit broader than that, but as it turned out, I would spend about 30 percent of my talk on surgical robotics.

Next, I needed to find two more professors to serve on the quals committee. I decided to ask Professor Sergey Levine if he would serve as a member of the committee.

Since Berkeley faculty can be overwhelmed with email, I was advised from other students to meet professors in office hours to ask about quals. I gambled and emailed Professor Levine instead. I introduced myself with a few sentences, and described the sketch of my quals talk to him, and then politely asked if he would serve on the committee.

I got an extremely quick response from Professor Levine, who said he already knew who I was, and that he would be happy to be on the committee. He additionally said it was the “least he could do” because I am the main curator for the BAIR blog, and he was the one who originally wanted the BAIR Blog up and running.

A ha! There’s a lesson here: if you want external faculty to serve on a committee, make sure you help curate a blog they like.

Now came the really hard part: the fourth committee member. To make matters worse, there is (in my opinion) an unnecessary rule that states that one has to have a committee member outside of EECS. At the time of my exam, I barely knew any non-EECS professors with the expertise to comment on my research area.

I scrolled through a list of faculty, and decided to try asking Professor Masayoshi Tomizuka from the Mechanical Engineering department. In part, I chose him because I wanted to emphasize that I was moving in a robotics direction for my PhD thesis work. Before most of my current robotics research, I did a little theoretical machine learning research, which culminated in a UAI 2017 paper. It also helped that his lab is located next to Professor Goldberg’s lab, so I sometimes got a peek at what his students were doing.

I knew there was a zero percent chance that Professor Tomizuka would respond to a cold email, so I went hunting for his office hours.2 Unfortunately, the Mechanical Engineering website had outdated office hours from an earlier semester. In addition, his office door also had outdated office hours.

After several failed attempts at reaching him, I emailed one of his students, who provided me a list of times. I showed up at the first listed time, and saw his office door closed for the duration of the office hours.

This would be more difficult than I thought.

Several days later, I finally managed to see Professor Tomizuka while he was walking to his office with a cup of coffee. He politely allowed me to enter his office, which was overflowing with books and stacks of papers. I don’t know how it’s possible to sift through all of that material. In contrast, when I was at Professor Levine’s office, I saw almost nothing but empty shelves.

Professor Tomizuka, at the time, was a professor at Berkeley for 44 years (!!!) and was still supervising a long list of PhD students. I explained to him about my qualifying exam plan. He asked a few questions, including “what questions do you want me to ask in your exam?” to which I responded that I was hoping he would ask about robot kinematics. Eventually, he agreed to serve on the committee and wrote my name on a post-it note for him to remember.

Success!

Well, not really — I had to schedule the exam, and that’s challenging with busy professors. After several failed attempts at throwing out times, I asked if the professors could provide a full list of their constraints. Surprisingly, both Professor Levine and Professor Tomizuka were able to state their constraints on each day of the week! I’m guessing they had that somewhere on file so that they could copy and paste it easily. From there, it was straightforward to do a few more emails to schedule the exam, which I formally booked about two months in advance.

Success!

All things considered, I think my quals exam scheduling was on the easier side compared to most students. The majority of PhD students probably also have difficulty finding their fourth (or even third) committee members. For example, I know one PhD student who had some extreme difficulty scheduling the quals talk. For further discussion and thoughts, see the end of this post.

I then needed to do my preparation for the exam. I wrote up a set of slides for a talk draft, and pitched them to Professor Canny. After some harsh criticism, I read more papers, did more brainstorming, and re-did my slides, to his approval. Professor Goldberg also generally approved of my slides. I emailed Professor Levine about the general plan, and he was fine with a “40-50 minute talk on prior research and what I want to do.” I emailed Professor Tomizuka but he didn’t respond to my emails, except to one of them a week before to confirm that he would show up to the talk.

I gave two full-length practice talks in lab meetings, one to Professor Goldberg’s lab, and then to Professor Canny’s lab. The first one was hideous, and the second was less hideous. In all, I went through twelve full-length talks talks to get the average below 50 minutes, which I was told is the general upper bound for which students should aim.

Then, at long last, Judgment Day came.

## The Beginning

Qualifying exam date: Tuesday April 24, 2018 at 3:00pm.

Obviously, I showed up way in advance to inspect the room that I had booked for the quals. I checked that my laptop and adapters worked with the slide system set in the room. I tucked in my dress shirt, combed my hair, cleaned my glasses for the tenth time, and stared at a wall.

Eventually, two people showed up: the sign language interpreters. One was familiar to me, since she had done many of my interpreting services in the past. The other was brand new to me. This was somewhat undesirable. Given the technical nature of the topic, I explicitly asked Berkeley’s Disabled Students’ Program to book only interpreters that had worked with me in the past. I provided a list of names more than two weeks in advance of the exam, but it was hard for them to find a second person. It seems like, just as with my prelims, it is difficult to properly schedule sign language interpreting services.

Professor Levine was the first faculty member to show up in the qualifying exam room. He carried with him a folder of my academic materials, because I had designated him as the “chair” of the quals committee (which cannot be one’s advisor). He said hello to me, took a seat, and opened my folder. I was not brave enough to peek into the files about me, and spent the time mentally rehearsing my talk.

Professor Tomizuka was the next to show up. He did not bring any supplies with him. At nearly the same time, Professor Canny showed up, with some food and drink. The three professors quickly introduced each other and shook their hands. All the professors definitely know each other, but I am not sure how well. There might be a generational gap. Professor Levine (at the time) was in his second year as a Berkeley faculty member, while Professor Tomizuka was in his 44th year. They quickly got settled in their seats.

At about 3:03pm, Professor Levine broke the painfully awkward silence: “are we on Berkeley time?”3

Professor Canny [chuckling]: “I don’t think we run those for the qualifying exam …”

Professor Levine [smiling]: “well, if any one professor is on Berkeley time then all the others have to be…”

While I pondered how professors who had served on so many qualifying exam committees in the past had not agreed on a settled rule for “Berkeley-time,” Professor Goldberg marched into the room wearing his trademark suit and tie. (He was the only one wearing a tie.)

“Hey everyone!” he smiled. Now we could start.

Professor Levine: “Well, as the chair of the committee, let’s get started. We’re going to need to talk among ourselves for a bit, so we’ll ask Daniel to step out of the room for a bit while we discuss.”

Gulp. I was already getting paranoid.

The sign language interpreters asked whether they should go out.

Professor Goldberg agreed: “Yeah, you two should probably leave as well.”

As I walked out the room, Professor Goldberg tried to mitigate my concerns. “Don’t worry, this is standard procedure. Be ready in five minutes.”

I was certainly feeling worried. I stood outside, wondering what the professors were plotting. Were they discussing how they would devour me during the talk? Would one of them lead the charge, or would they each take turns doing so?

I stared at a wall while the two sign language interpreters struck up a conversation, and commented in awe about how “Professor Goldberg looks like the typical energetic Berkeley professor.” I wasn’t interested in their conversation and politely declined to join since, well, I had the qualifying exam now!!

Finally, after what seemed like ten minutes — it definitely was not five — Professor Goldberg opened the door and welcomed us back in.

It was time.

## During The Talk

The professors nodded and stared at me. Professor Goldberg was smiling, and sat the closest to me, with notebook and pen in hand.

My talk was structured as follows:

• Part I: introduction and thesis proposal
• Part II: my prior work
• Part III: review of relevant robot learning research
• Part IV: potential future projects

I gave a quick overview of the above outline in a slide, trying to speak clearly. Knowing the serious nature of the talk, I had cut down on my normal humor during my talk preparation. The qualifying exam talk was not the time to gamble on humor, especially since I was not sure how Professor Tomizuka or Professor Levine would react to my jokes.

Things were going smoothly, until I came to my slide about “robot-to-robot teaching.” I was talking in the context of how to “transfer” one robot policy to another robot, a topic that I had previously brainstormed about with both Professor Goldberg and Professor Canny.

Professor Goldberg asked the first question during the talk. “When you say robot-to-robot teaching, why can’t we just copy a program from one robot to another?” he asked.

Fortunately this was a question I had explicitly prepared myself for during my practice talks.4

“Because that’s not teaching, that’s copying a program from one to another, and I’m interested in knowing what happens when we teach. If you think of how humans teach, we can’t just copy our brains and embed them into a student, nor do we write an explicit program of how we think (that would be impossible) and tell the student to follow it. We have to convey the knowledge in a different manner somehow, indirectly.”

Professor Goldberg seemed to be satisfied, so I moved on. Whew, crisis averted.

I moved on, and discussed our surgical robotics work from the ICRA 2018 paper. After rehashing some prior work in calibrating surgical robots, and just as I was about to discuss the details on our procedure, Professor Tomizuka raised his hand. “Wait can you explain why you have cheaper sensors than the prior work?”

I returned to the previous slide. “Prior work used these sophisticated sensors on the gripper which allows for better estimates of position and orientation” I said, pointing at an image which I was now thankful to have included. I provided him with more details on the differences between prior work and our work.

Professor Tomizuka seemed about half-satisfied, but motioned for me to continue with the talk.

I went through the rest of my talk, feeling at ease and making heavy eye contact with the professors, who were equally attentive.

No further interruptions happened.

When I finished the talk, which was right about 50 minutes, I had my customary concluding slide of pictures of my collaborators. “I thank all my collaborators,” I said. I then specifically pointed to the two on the lower right: pictures of Professor Canny and Professor Goldberg. “Especially the two to the lower right, thank you for being very patient with me.” In retrospect, I wish I had made my pictures of them bigger.

“And that’s it,” I said.

The professors nodded. Professor Goldberg seemed like he was trying to applaud, then stopped mid-action. No one else moved.

## Immediately After The Talk

Professor Levine said it was time for additional questions. He started by asking: “I see you’ve talked about two kinds of interactive learning, one with an adversary, one with a teacher. I can see those going two different directions, do you plan to try and do both and then converge later?”

I was a little confused by this question, which seemed open-ended. I responded: “yes there are indeed two ways of thinking of interactive teaching, and I hope to pursue both.” Thinking again at my efforts at implementing code, I said “from my experience, say with Generative Adversarial Networks as an example, it can be somewhat tricky to get adversarial learning to work well, so perhaps to start I will focus on a cooperative teacher, but I do hope to try out both lines of thinking.”

I asked if Professor Levine was satisfied, since I was worried I didn’t answer well enough, and I assumed he was going to ask something more technical. In addition, GANs are fairly easy to implement, particularly with so many open-source implementations nowadays for reference. Surprisingly, Professor Levine nodded in approval. “Any other questions?”

Professor Goldberg had one. “Can you go back to one of the slides you said about student’s performance? The one that said if the student’s performance is conveyed with $P_1$ [which may represent trajectories in an environment] and from that the teacher can determine the student’s weakest skill so that the next set of data $P_2$ from the student shows improvement …””

I flipped back briefly to the appropriate slide. “This one?”

Professor Goldberg: “yes, that one. This sounds interesting, but you can think of a problem where you teach an agent to improve upon a skill, but then that results in a deterioration of another skill. Have you thought about that?”

“Yes, I have,” I said. “There’s actually an interesting parallel in the automated curriculum papers I’ve talked about, where you sample goals further and further away so you can learn how to go from point $A$ to point $B$. The agent may end up forgetting how to go from point $A$ to a point that was sampled earlier in the sequence, so you need to keep a buffer of past goals at lower difficulty levels so that you can continually retrain on those.”

Professor Goldberg: “sounds interesting, do you plan to do that?”

“I think so, of course this will be problem dependent,” I responded, “so I think more generally we just need a way to detect and diagnose these, by repeatedly evaluating the student on those other skills that were taught earlier, and perhaps do something in response. Again problem dependent but the idea of checking other skills definitely applies to these situations.”

Professor Levine asked if anyone had more questions. “John do you have a question?”

“No,” he responded, as he finished up his lunch. I was getting moderately worried.

“OK, well then …” Professor Levine said, “we’d now like Daniel to step outside the room for a second while we discuss among ourselves.”

I walked outside, and both of the interpreters followed me outside. I had two interpreters booked for the talk, but one of them (the guy who was new to me) did not need to do any interpreting at all. Overall, the professors asked substantially fewer questions than I had expected.

## The Result

After what seemed like another 10 minutes of me staring at the same wall I looked at before the talk, the door opened. The professors were smiling.

Professor Levine: “congratulations, you pass!”

All four approached me and shook my hand. Professor Canny and Professor Tomizuka immediately left the room, as I could tell they had other things they wanted to do. I quickly blurted out a “thank you” to Professor Canny for his patience, and to Professor Tomizuka for simply showing up.

Professor Goldberg and Professor Levine stayed slightly longer.

While packing up, Professor Levine commended me. “You really hit upon a lot of the relevant literature in the talk. I think perhaps the only other area we’d recommend more of is the active learning literature.”

Professor Goldberg: “This sounds really interesting, and the three year time plan that you mention for your PhD sounds about right to get a lot done. In fact think of robot origami, John mentioned that. You’ve seen it, right? I show it in all the talks. You can do robot teaching on that.”

“Um, I don’t think I’ve seen it?” I asked.

Professor Goldberg quickly opened up his laptop and showed me a cool video of a surgical robot performing origami. “That’s your PhD dissertation” he pointed.

I nodded, smiling hard. The two professors, and the sign language interpreters, then left the room, and I was there by myself.

Later that day, Professor Levine sent a follow-up email, saying that my presentation reminded him of an older paper. He made some comments about causality, and wondered if there were opportunities to explore that in my research. He concluded by praising my talk and saying it was “rather thought-provoking.”

I was most concerned about what Professor Canny thought of the talk. He was almost in stone-cold silence throughout, and I knew his opinion would matter greatly in how I could construct a research agenda with him in the coming years. I nervously approached Professor Canny when I had my next one-on-one meeting with him, two days after the quals. Did he think the talk was passable?? Did he (gulp) dislike the talk and only passed me out of pity? When I asked him about the talk …

He shrugged nonchalantly. “Oh, I thought it was very good.” And he pointed out, among other things, that I had pleasantly reminded him of another colleague’s work, and that there were many things we could do together.

Wait, seriously?? He actually LIKED the talk?!?!?!?

I don’t know how that worked out. Somehow, it did.

## Retrospective

I’m writing this post more than 1.5 years after I took the actual exam. Now that some time has passed here are some thoughts.

My main one pertains to why we need a non-EECS faculty member. If I have any suggestion for the EECS department, it would be to remove this requirement and to allow the fourth faculty to be in EECS. Or perhaps we can allow faculty who are “cross-listed” in EECS to count as outside members. The faculty expertise in EECS is so broad that it probably is not necessary to reach out to other departments if it does not make sense for a given talk. In addition, we also need to take an honest look as to how much expertise we can glean from someone in a 1.5-hour talk, and if it makes sense to ask for 1.5 hours of that professor’s time when that professor could be doing other, more productive things for his/her own research.

I am fortunate that scheduling was not too difficult for me, and I am thankful to Professor Tomizuka for sitting in my talk. My concern, however, is that some students may have difficulty finding that last qualifying exam member. For example, here’s one story I want to share.

I know an EECS PhD student who had three EECS faculty commit to serving on the quals committee, and needed to find a fourth non-EECS faculty. That student’s advisor suggested several names, but none of the faculty responded in the affirmative. After several months, that student searched for a list of faculty in a non-EECS department.

The student found one faculty who could be of interest, and who I knew served as an outside faculty member on one EECS quals before. After two weeks of effort (due to listed office hours that were inaccurate, just as I experienced), the student was able to confirm to get a fourth member. Unfortunately, this happened right when summer began, and the faculty on the student’s committee were traveling and never in the same place at the same time. Scheduling would have to be put off until the fall.

When summer ended and fall arrived, that student was hoping to schedule the qualifying exam, but was no longer able to contact the fourth non-EECS faculty. After several futile attempts, the student gave up and tried a second non-EECS faculty, and tentatively got confirmation. Unfortunately, once again, the student was not able to contact the faculty member again when it was time to schedule.

It took several more months before the student, with the advisor’s help, was able to find that last, elusive faculty member to serve on the committee.

In all, it took one year for that student to get a quals committee set up! That’s not counting the time that the student would then need to schedule it, which normally has to be done 1 or 2 months in advance.

Again, this is only one anecdote, and one story might not be enough to spur a change in policy, but it raises the question as to why we absolutely need an “outside” faculty member. That student’s research is in a very interesting and important area in EECS, but it’s also an area that isn’t a neat fit for any other department, and it’s understandable that faculty who are not in the student’s area would not want to spend 1.5 hours listening to a talk. There are many professors within EECS that could have served as the fourth faculty, so I would suggest we change the policy.

Moreover, while I don’t know if this is still the current policy, I read somewhere once that students can only file their dissertations at least two semesters after their qualifying exam. Thus, significant delays in getting the quals exam done could delay graduation. Again, I am not sure if this is still the official policy, so I will ask the relevant people in charge.

Let’s move on to some other thoughts. During my quals, the professors didn’t bring a lot of academic material with them, so I am guessing they probably expected me to pass. I did my usual over-preparation, but I don’t think that’s a bad thing. I was also pitching a research direction that (at the time) I had not done research in, but it looks like that is also acceptable for a quals, provided that the talk is of sufficient quality.

I was under a ridiculous amount of stress in the months of February, March, and April (until the quals itself), and I never want to have to go through months like those again. It was an incredible relief to get the quals out of the way.

Finally, let me end with some acknowledgments by thanking the professors again. Thank you very much to the professors who served on the committee. Thank you, Professors John Canny, Ken Goldberg, Sergey Levine, and Masayoshi Tomizuka, for taking the time to listen to my talk, and for your support. I only hope I can live up to your expectations.

1. At the time, I was not formally advised by him. Now, the co-advising is formalized.

2. I felt really bad trying to contact Professor Tomizuka. I don’t understand why we have to ask professors we barely know to spend 1.5 hours of their valuable time on a qualifying exam talk.

3. Classes at UC Berkeley operate on “Berkeley time,” meaning that they start 10 minutes after their official starting time. For example, a class that lists a starting time of 2:30pm starts at 2:40pm in practice.

4. As part of my preparation for the qualifying exam, I had a list of about 50 questions that I felt the faculty would ask.