# Papers That Have Cited Policy Distillation

About a week and a half ago, I carefully read the Policy Distillation paper from DeepMind. The algorithm is easy to understand yet surprisingly effective. The basic idea is to have student and teacher agents (typically parameterized as neural networks) acting on an environment, such as the Atari 2600 games. The teacher is already skilled at the game, but the student isn’t, and need to learn somehow. Rather than run standard deep reinforcement learning, DeepMind showed that simply running supervised learning where the student trains its network to match a (tempered) softmax of the Q-values of the teacher is sufficient to learn how to play an Atari 2600 game. It’s surprising that this works; for one, Q-values are not even a probability distribution, so it’s not straightforward to conclude that a student trained to match the softmaxes would be able to learn a sequential decision-making task.

It was published in ICLR 2016, and one of the papers that cited this was Born Again Neural Networks (to appear in ICML 2018), a paper which I blogged about recently. The algorithms in these two papers are similar, and they apply in the reinforcement learning (PD) and supervised learning (BANN) domains.

After reading both papers, I developed the urge to understand all the Policy Distillation follow-up work. Thus, I turned to Google Scholar, one of the greatest research conveniences of modern times; as of this writing, the Policy Distillation paper has 68 citations. (Google Scholar sometimes has a delay in registering certain citations, and it also lists PhD theses and textbooks, so the previous sentence isn’t entirely accurate, but it’s close enough.)

I resolved to understand the main idea of every paper that cited Policy Distillation, especially with how relevant the paper is to the algorithm. I wanted to understand if papers directly extended the algorithm, or if they simply cited it as related work to try and boost up the citation count for DeepMind.

I have never done this before to a paper with more than 15 Google Scholar citations, so this was new to me. After spending a week and a half on this, I think I managed to get the gist of Policy Distillation’s “follow-up space.” You can see my notes in this shareable PDF which I’ve hosted on Dropbox. Feel free to send me recommendations about other papers I should read!

# Born Again Neural Networks

I recently read Born Again Neural Networks (to appear at ICML 2018) and enjoyed the paper. Why? First, the title is cool. Second, it’s related to the broader topics of knowledge distillation and machine teaching that I have been gravitating to lately. The purpose of this blog post is to go over some of the math in Section 3 and discuss its implications, though I’ll assume the reader has a general idea of the BAN algorithm. As a warning, notation is going to be a bit tricky/cumbersome but I will generally match with what the paper uses and supplement it with my preferred notation for clarity.

We have $\mathbf{z}$ and $\mathbf{t}$ representing vectors corresponding to the student and teacher logits, respectively. I’ll try to stick to the convention of boldface meaning vectors, even if they have subscripts to them, which instead of components means that they are part of a sequence of such vectors. Hence, we have:

or we can also write $\mathbf{z} = \mathbf{z}_k$ if we’re considering a minibatch $\{\mathbf{z}_1, \ldots, \mathbf{z}_b\}$ of these vectors.

Let $\mathbf{x}$ denote input samples (also vectors) and let $Z=\sum_{k=1}^n e^{z_k}$ and $T=\sum_{k=1}^n e^{t_k}$ to simplify the subsequent notation, and consider the cross entropy loss function

which here corresponds to a single-sample cross entropy between the student logits and the teacher’s logits, assuming we’ve applied the usual softmax (with temperature one) to turn these into probability distributions. The teacher’s probability distribution could be a one-hot vector if we consider the “usual” classification problem, but the argument made in many knowledge distillation papers is that if we consider targets that are not one-hot, the student obtains richer information and achieves lower test error.

The derivative of the cross entropy with respect to a single output $z_i$ is often applied as an exercise in neural network courses, and is good practice:

or $q_i - p_i$ in the paper’s notation. (As a side note, I don’t understand why the paper uses $\mathcal{L}_i$ with a subscript $i$ when the loss is the same for all components?) We have $i \in \{1, 2, \ldots, n\}$, and following the paper’s notation, let $*$ represent the true label. Without loss of generality, though, we assume that $n$ is always the appropriate label (just re-shuffle the labels as necessary) and now consider the more complete case of a minibatch with $b$ elements and considering all the possible logits. We have:

and so the derivative we use is:

Just to be clear, we sum up across the minibatch and scale by $1/b$, which is often done in practice so that gradient updates are independent of minibatch size. We also sum across the logits, which might seem odd but remember that the $z_{i,s}$ terms are not neural network parameters (in which case we wouldn’t be summing them up) but are the outputs of the network. In backpropagation, computing the gradients with respect to weights requires computing derivatives with respect to network nodes, of which the $z$s (usually) form the final-layer of nodes, and the sum here arises from an application of the chain rule.

Indeed, as the paper claims, if we have the ground-truth label $y_{*,s} = 1$ then the first term is:

and thus the output of the teacher, $p_{*,s}$ is a weighting factor on the original ground-truth label. If we were doing the normal one-hot target, then the above is the gradient assuming $p_{*,s}=1$, and it gets closer and closer to it the more confident the teacher gets. Again, all of this seems reasonable.

The paper also argues that this is related to importance weighting of the samples:

So the question is, does knowledge distillation (called “dark knowledge”) from (Hinton et al., 2014) work because it is performing a version of importance weighting? And by “a version of” I assume the paper refers to this because it seems like the $q_{*,s}$ is included in importance weighting, but not in their interpretation of the gradient.

Of course, it could also work due to to the information here:

which is in the “wrong” labels. This is the claim made by (Hinton et al., 2014), though it was not backed up by much evidence. It would be interesting to see the relative contribution of these two gradients in these refined, more sophisticated experiments with ResNets and DenseNets. How do we do that? The authors apply two evaluation metrics:

• Confidence Weighted by Teacher Max (CWTM): One which “formally” applies importance weighting with the argmax of the teacher.
• Dark Knowledge with Permuted Predictions (DKPP): One which permutes the non-argmax labels.

These techniques apply the argmax of the teacher, not the ground-truth label as discussed earlier. Otherwise, we might as well not be doing machine teaching.

It appears that if CWTM performs very well, one can conclude most of the gains are from the importance weighting scheme. If not, then it is the information in the non-argmax labels that is critical. A similar thing applies to DKPP, because if it performs well, then it can’t be due to the non-argmax labels. I was hoping to see a setup which could remove the importance weighting scheme, but I think that’s too embedded into the real/original training objective to disentangle.

The experiments systematically test a variety of setups (identical teacher and student architectures, ResNet teacher to DenseNet student, applying CWTM and DKPP, etc.). They claim improvements across different setups, validating their hypothesis.

Since I don’t have experience programming or using ResNets or DenseNets, it’s hard for me to fully internalize these results. Incidentally, all the values reported in the various tables appear to have been run with one random seed … which is extremely disconcerting to me. I think it would be advantageous to pick fewer of these experiment setups and run 50 seeds to see the level of significance. It would also make the results seem less like a laundry list.

It’s also disappointing to see the vast majority of the work here on CIFAR-100, which isn’t ImageNet-caliber. There’s a brief report on language modeling, but there needs to be far more.

Most of my criticisms are a matter of doing more training runs, which hopefully should be less problematic given more time and better computing power (the authors are affiliated with Amazon, after all…), so hopefully we will have stronger generalization claims in future work.

Update 05/29/2018: After reading the Policy Distillation paper, it looks like that paper already showed that matching a tempered softmax (of Q-values) from the teacher using the same architecture resulted in better performance in a deep reinforcement learning task. Given that reinforcement learning on Atari is arguably a harder problem than supervised learning of CIFAR-100 images, I’m honestly surprised that the Born Again Neural Networks paper got away without mentioning the Policy Distillation comparison in more detail, even when considering that the Q-values do not form a probability distribution.

# International Conference on Robotics and Automation (ICRA) 2018, Day 5 of 5

ICRA, like many academic conferences, schedules workshops and/or tutorials on the beginning and ending days. The 2018 edition was no exception, so for the fifth and final day, it offered about 10 workshops on a variety of topics. Succinctly, these are venues where a smaller group of researchers can discuss a common research sub-theme. Typically, workshops invite guest speakers and have their own poster sessions for works-in-progress or for shorter papers. These are less prestigious for full conference papers, which is why I don’t submit to workshops.

I attended most of the cognitive robotics workshop, since it included multi-robot and human-robot collaboration topics.

In the morning session, at least two of the guest speakers hinted some skepticism of Deep Learning. One, for instance, had this slide:

An amusing slide at the day's workshop, featuring our very own Michael I. Jordan.

which features Berkeley professor Michael I. Jordan’s (infamous) IEEE interview from four years ago. I would later get to meet the speaker when he walked over to me to inquire about the sign language interpreting services (yay, networking!!). I obviously did not have much to offer him in terms of technical advice, so I recommended that he read Michael I. Jordan’s recent Medium blog post about how the AI revolution “hasn’t happened yet.”

The workshops were located near each other, so there were lots of people during the food breaks.

I stayed for the full morning, and then for a few more talks in the afternoon. Eventually, I decided that the topics being presented — while interesting in their own right — were less relevant to my immediate research agenda than I had originally thought, so I left at about 2:00pm, my academic day done. For the rest of the afternoon, I stayed at the convention center and finally finished reading Enlightenment Now: The Case for Reason, Science, Humanism, and Progress.

While I was reading the book, I took part in my bad habit of checking my phone and social media. I had access to a Berkeley Facebook group chat, and it turns out that many of the students went traveling today to other areas in Brisbane.

Huh, I wonder if frequent academic conference attendees often skip the final “workshop day”? Just to be clear, I don’t mean these workshops are pointless or useless, but maybe the set of workshops is too heavily specialized or just not as interesting? I noticed a similar trend with UAI 2017, in that the final workshop day had relatively low attendance.

Now that the conference is over, my thoughts generally lean positive. Sure, there are nitpicks here and there: ICRA isn’t double-blind (which seems contrary to best science practices) and is pricey, as I mentioned in an earlier blog post. But as a consequence, ICRA is well-funded and exudes a sophisticated feel. The Brisbane venue was fantastic, as was the food and drink.

As always, I don’t think I networked enough, but I noticed that most Berkeley students ended up sticking with people they already knew, so maybe students don’t network as much as I thought?

I also have praise for my sign language interpreters, who tried hard. They also taught me about Auslan and the differences in sign language between Australia and the United States.

Well, that’s a wrap for ICRA. It is time for me to fly back to Vancouver and then to San Francisco … life will return to normal.

# International Conference on Robotics and Automation (ICRA) 2018, Day 4 of 5

For the fourth day of ICRA, I again went running (for the fourth consecutive morning). This time, rather than run across the bridge to get to the South Bank, I ran on a long pathway that extended below some roads:

Below the roads, there is a long paved path.

Normally, I would feel hesitant to run underneath roads, since (at least in America) those places tend to be messy and populated by those with nowhere else to live. But the pathway here was surprisingly clean, and even at 6:30am, there were a considerable amount of walkers, runners, and bikers.

After my usual preparation, I went over to the conference for the 9:00am plenary talk, provided by Queensland Professor Mandyam Srinivasan.

Professor Mandyam Srinivasan gave the third plenary talk for ICRA 2018.

As usual, it was hard to follow the technical details of the talk. The good news is that the talk was high-level, probably (almost) as high-level as Professor Brooks’ talk, and certainly less technical than Raia Hadsell’s talk. I remember there being lots of videos in this plenary, which presents logistical “eye-challenges” since I have to figure out a delicate balance of looking at the video or the sign language interpreter.

Due to the biological nature of the talk, I also remembered Professor Robert Full’s thrilling keynote at the Bay Area Robotics Symposium last November. I wonder if those two have ever collaborated?

I stayed for the keynote talk after that, about soft robotics, and then we had the morning poster session. As usual, there was plenty of food and drink, and I had to resist the urge to keep making trips to the food tables. The food items followed the by-now familiar pattern of one “sweet” and one “savory” item:

The food selection for today's morning poster session.

Later, we had the sixth and final poster session of the conference. The most interesting thing for me was … me, since that was when I presented my poster:

I, standing by my poster.

I stood there for 2.5 hours and talked with a number of conference attendees. Thankfully, none of the conversations were hostile or overly combative. People by and large seemed happy with what I was doing and saying. Also, my sign language interpreters finally had something to do during the poster sessions, since for the other five I had mostly been walking around without talking to people.

After the poster session, we had the farewell reception, which (as you can expect) was filled with lots of food and drinks. It took place in the plaza level of the convention center, which included an outside area along with several adjacent indoor rooms.

The food items included the usual bread, rice, and veggie dishes. For meat, we had salmon, sausages, and steak:

Some delicious steaks being cooked.

The steak was delicious!

Interestingly enough, the “dessert” turned out to be fruit, breaking the trend from past meals.

The farewell reception was crowded and dark, but the food was great.

The reception was crowded with long lines for food, particularly for the steak (obviously!). The other food stations providing the salmon and sausages were frequently out of stock. These are, however, natural problems since most of us were grabbing as much meat as we could during our first trips to the food tables. Maybe we need an honor code about the amount of meat we consume?

As an aside, I think for future receptions, ICRA should provide backpack and poster tube storage. We had that yesterday for the conference dinner and it was very helpful since cocktail-style dining means both hands are often holding something — one for alcoholic beverages and the other for food. Since I had just finished presenting my poster/paper, I was awkwardly lugging around a poster tube. My sign language interpreter kindly offered to hold it for the time I was there.

Again, ICRA does not skimp on the food and beverages. Recall that we had a welcome reception (day one), a conference dinner (day three) and the farewell reception (day four, today), so it’s only the second and fifth evenings that the conference doesn’t officially sponsor a dinner.

# International Conference on Robotics and Automation (ICRA) 2018, Day 3 of 5

For the third day, I did my usual morning run and then went to the conference. Today’s plenary was from DeepMind’s Raia Hadsell.

Her talk was about navigation, and in the picture above you see her mention London taxi drivers. I was aware of them after reading Peak, and it was a pleasure to see the example appear here. On the other hand, I’m not sure how much we will be needing taxi drivers with Uber and automated cars so … maybe it’s not good try and be a taxi driver nowadays, even in London?

Anyway, a lot of the talk was focused on the navigation parts, and the emphasis was on real-world driving because, as Hadsell admitted, many of the examples that DeepMind uses in their papers (particularly, the Labyrinth) are simulated.

After her talk, we had Pieter Abbeel’s keynote on Learning to Learn. It was similar to his other talks, with discussion on the RL^2 algorithm and his other meta-learning papers.

The standard poster session followed Abbeel’s keynote. The food for the morning was a “vegetable bun” (some starch with veggies inside) and cupcakes.

I didn’t mention this in my last post, but ICRA also contains some late-breaking results, which you can see in the poster session here:

ICRA contains late-breaking results, which you can see here behind the pre-arranged dietary catering.

This is necessary because ICRA has such a long turnaround time compared to other computer science conferences. The deadline for papers is September 15, with decisions by January 15, and then the conference in May 21-25. By contrast, ICML, NIPS, and other conferences have far faster turnaround times, so it’s good for ICRA to allocate a little space for outstanding yet recent results.

We had lunch, an afternoon keynote, another poster session, and so forth. I found a Berkeley student who I had been wanting to talk to for a while, so that was good. Other than that, I spent most of my time walking around, taking pictures of interesting posters (but not really talking to anyone) and then I sat down in a resting area and did some blogging.

Soon, the social highlight of the day would occur: the conference-sponsored, cocktail-style dinner. (Well, actually, there were two dinners, this one and a “Global Entrepreneurs” dinner held at the same time slot, but for the latter you had to pay extra, and I’m guessing most conference attendees opted for the dinner I went to.)

The conference dinner was at the Queensland Gallery of Modern Art, built just a few blocks away from the conference venue. I didn’t know what a “cocktail-style dinner” meant, so I was initially worried that the event was going to be a sit-down dinner with lots of tables, and that I would be sitting by myself or with some random strangers with foreign accents.

My concerns were alleviated by seeing all the open space by the entrance:

The start of the dinner, before we were allowed inside.

I like cocktail-style dinners since it means I can move around quickly in case I get bored or trapped in a conversation with someone who I can’t understand — it’s best to cut ties and walk away politely if it’s clear that communication will not work out.

Here was what the crowd looked like at dinner. These are pictures of the outside areas, but there was also considerable space inside the building.

The lines were very long, and that’s probably my only criticism for the night. There were a few staff members who walked out with food for us to pick, and I think we needed more of those. The long lines notwithstanding, the food and drinks were really good, so I don’t want to take too much away from the event — it was far more impressive than what I had imagined for a conference dinner.

I would also be remiss if I didn’t mention that there was some impressive art in the museum. (No food and drink allowed in these areas!)

There was some nice art!

I stayed for an hour and a half and then headed back to my hotel room.

# International Conference on Robotics and Automation (ICRA) 2018, Day 2 of 5

The second day started out much like the first one. I went for a run, this time in the opposite direction as I did yesterday to explore a different part of the bay. I ran for a few miles, then went back to the hotel to get ready for the conference. This time, I skipped the hotel’s breakfast, since I wanted to save money and I figured the conference would give us a heaping of food today, as they did yesterday (I was right).

After some customary opening remarks, we had the first plenary talk. Just to be clear, ICRA has two main talk styles:

• Plenary talks are each one hour, held from the 9:00am to 10:00am slots on Tuesday, Wednesday, and Thursday. These take place in the largest room, the Great Hall, and are designed to be less about the technical details and more about engaging everyone in the conference.

• Keynote talks are each a half-hour, held from 10:00am to 10:30am and then later from 2:00pm to 2:30pm on Tuesday, Wednesday, and Thursday. At these time slots, three keynotes are provided at any given time, so you have to pick which one you want to be in. These are still designed to be less technical than most academic talks, but more technical and specialized than the plenary talks.

Professor and entrepreneur Rodney Brooks was scheduled to give the first plenary talk, and in many ways he is an ideal speaker. Brooks is widely regarded in the field of robotics and AI, with papers galore. He’s also an entrepreneur, having founded iRobot and Rethink Robotics. I’ll refer you to his website so that you can feel inadequate.

The other reason why Brooks must have been invited as a speaker is that he’s a native Australian (Peter Corke emphasized: “with an Australian passport!”), and 2018 is the first year that ICRA has been held in Australia. (Though, having said that, I wonder why they didn’t invite my advisor, John Canny, who like Brooks is also Australian — in fact, Canny is also from the Adelaide area. Of course, with only three plenaries, it would look too much like “home cooking” with two Australians speaking …)

Brooks gave an engaging, high-level talk on some of the challenges that we face today in robotics applied to the real world. He talked about demographics, climate change (see next image), and issues with self-driving cars deployed to the real world. There was also an entertaining rant about the state of modern Human-Robot Interaction research, since we’re using Amazon Mechanical Turk (ugh) and still relying on dumb-ass $p$-values (ugh ugh).

Professor Rodney Brooks providing information regarding some climate change effects.

After Brooks’ talk, we had a excellent keynote on machine learning and robotics by Toronto professor Angela Schoellig. I didn’t take too many pictures for this one since this seemed to be a different flavor of machine learning and robotics than I’m used to, but I might investigate some of her papers later.

Then we had the “morning tea” (or more accurately: tea and coffee and pastries and fruits) plus the morning poster session held simultaneously, in the same room as the welcome reception from last night. This was the first of six major poster sessions (morning and afternoon for each of Tuesday, Wednesday, and Thursday). It consisted of several “pods” that you can see here:

The pods that were set up in the poster session.

I think the arrangement seems interesting, and is different from the usual straight aisles that we see in poster sessions. The pods make it easier to see which papers are related to each other since all one has to do is circle the corresponding pod. They might also be more space-efficient.

The food at the poster session was plentiful:

The catering services in the morning tea session.

It again wasn’t the standard kind of food you’d expect at your standard American hotel, reflecting both the diversity of food in Australia and also the amount of money that ICRA has due to sponsors and high registration fees. Oh, and the silly surcharge for extra pages in papers.

Ironically, the coffee and hot water stations appeared to be frequently empty or close to empty, prompting one of the sign language interpreters to argue that this was the reverse of last night, in which we had far too easy access to lots of drinks, but not much access to food.

Closer to noon, while the poster session was still going on — though I think most attendees stated to tire of walking around the pods — the employees brought out some lunch-style food.

The food scene after the convention employees brought lunch.

All of this was happening in half of the gigantic exhibition area. The other half had industry sponsor booths, like in the welcome reception from last night. You can see a few of them in the background of the above photo.

I took a break to relax and read a research paper since my legs were tired, but by 2:00pm, another one of the day’s highlights occurred: Professor Ken Goldberg’s keynote talk on grasping and experiment reproducibility:

Professor Ken Goldberg's call for reproducibility in his excellent afternoon keynote.

It was a terrific talk, and I hope that people will also start putting failure cases in papers. I will remember this, though I think I am only brave enough to put failure cases in appendices, and not the main page-constrained portion of my papers.

After Ken’s talk, there was another poster session with even more food, but frankly, I was so full and a bit weary of constantly walking around, so I mostly just said hi to a few people I knew and then went back to the hotel. There was a brief Nvidia-hosted reception that I attended to get some free food and wine, but I did not stay very long since I did not know anyone else.

Before going to sleep, I handled some administration with my schedule. I tried to reassure the person who was handling the sign language accommodation that, for the poster sessions, it is not necessary to have two or more interpreters since I spend the time mostly wandering around and taking pictures of interesting papers. It’s impossible to deeply understand a paper just by reading a poster and talking with the author, so I take pictures and then follow-up by reading as needed.

I also asked her to cancel the remaining Deaf/Hearing Interpreter (DI/HI) appointments since those were simply too difficult for me to understand and to benefit (see my previous post for what these entail). To be fair, the other “normal” interpreting services were not that beneficial, but I could at least understand perhaps 10-30% of the content in “broken” fashion. But for DI/HI, I simply don’t think it’s helpful to have spoken English translated to Australian sign language, and then translated to American/English sign language. I felt bad about canceling since it wasn’t the fault of the Deaf/Hearing Interpreter team, but at the same time I wanted to be honest about how I was feeling about the interpreting services.

# International Conference on Robotics and Automation (ICRA) 2018, Day 1 of 5

Due to my clever sleep schedule, I was able to wake up at 5:00am and feel refreshed. As I complained in my previous blog post, the hotel I was at lacks a fitness center, forcing me to go outside and run since I cannot for the life of me live a few days without doing something to improve my physical fitness.

I killed some time by reviewing details of the conference, then ran outside once the sun was rising. I ran on the bridge that crosses the river and was able to reach the conference location. It is close to a park, which has a splendid display of “BRISBANE” as shown here:

A nice view of Brisbane's sunrise.

After running, I prepared for the conference. I walked over and saw this where we were supposed to register:

The line for registration on Monday morning.

This picture doesn’t do justice in showing ICRA’s popularity. There were a lot of attendees here.

But first, I had to meet my sign language interpreters! A few comments:

• I decided to use sign language interpreting services rather than captioning. I have no idea if this will be better but it’s honestly hard to think about how it can be worse than the captioning from UAI 2017.
• Despite how Australia and America are both English-speaking countries, the sign language used there (“Auslan”) is not the same as the one used in America.
• Thankfully, Berkeley’s DSP found an international interpreting agency which could try and find interpreters familiar with American signing. They hired one who specialized in ASL and who has lived in both Australia and the US, making him an ideal choice. The other interpreters specialized in different sign languages or International Sign.
• There was a “Deaf Interpreter/Hearing Interpreter” team. Essentially, this means having a deaf person who knows multiple sign languages (e.g., ASL and Auslan). That deaf interpreter is the one who signs for me, but he/she actually looks at a hearing interpreter who can interpret in one of the sign languages that both of them know (e.g., Auslan) but which I don’t. Thus, the translation would be: spoken English, to Auslan, to American signing. The reason for this is obvious, since in Australia, most interpreters know Auslan, but not ASL. I wouldn’t see this team until the second day of the conference, but the experience turned out to be highly unwieldy and wasn’t beneficial, so I asked to discontinue the service.
• All of the interpreting services were obtained after a four month process of Berkeley’s DSP searching for an international interpreting agency and then booking them for this conference. Despite the long notice, some of the schedule was still in flux and incomplete at the conference start date, so it goes to show that even four months might not be enough to get things booked perfectly. To be fair, it’s more like two or three months, since conference schedules aren’t normally released until a month or so after paper decisions come out. That’s one of my complaints about academic conferences, but I’ll save my ranting for a future blog post.

I met them and after customary introductions, it was soon 9:00am, when the first set of tutorials and workshops began. It seemed to be structured much like UAI 2017, in that there are workshops and tutorials on the first and last days, while the “main conference” lies in between. For us, this meant Monday and Friday were for the workshops/tutorials.

There were several full-day and half-day sessions offered on a variety of topics. I chose to attend the “Deep Learning for Robotics Perception” tutorial, because it had “Deep Learning” in the title.

The morning tutorial on deep learning for robotics perception.

For this conference, I decided not to take detailed notes of every talk. I did that for UAI 2017, and it turned out to be of no use whatsoever as I never once looked at my Google Doc notes after the conference ended. Instead, my strategy now is to take pictures of any interesting slides, and then scan my photos after the conference to see if there’s anything worthwhile to follow-up.

The Deep Learning tutorial was largely on computer vision techniques that we might use for robotics. Much of the first half was basic knowledge to me. In the second half, it was a pleasure to see them mention the AUTOLAB’s work on Grasp-Quality Convolutional Neural Networks.

The interpreters had a massively challenging task with this tutorial. The one who knew ASL well was fine, but another one — who mentioned ASL was her fourth-best sign language — had to quit after just a few seconds (she apologized profusely) and be replaced. The third, who was also somewhat rusty with ASL, lasted his full time set, though admittedly his signing was awkward.

Fortunately, the one who had to quit early was able to recover and for her next 20-minute set, she was able to complete it, albeit with some anxiety coupled with unusual signs that I could tell were international or Auslan. Even with the American interpreter, it was still tremendously challenging for me to even follow the talk sentence-by-sentence, so I felt frustrated.

After lunch, we had the afternoon tutorials in a similar format. I attended the tutorial on visual servoing, featuring four 45-minute talks. The third was from UC Berkeley Professor Pieter Abbeel, who before beginning the talk found me in the crowd1 and congratulated me for making the BAIR blog a success.

You can imagine what his actual talk must have looked like: a packed, full room of amazed attendees trying to absorb as much of Pieter’s rapid-fire presentation as possible. I felt sorry for the person who had to present after Pieter, since about 80% of people the room left after Pieter finished his talk.

The sign language situation in the afternoon tutorials wasn’t much better than that of the morning tutorials, unfortunately. The presentation I understood the most in the afternoon was, surprise surprise, Pieter’s, but that’s because I had already read almost all of the corresponding research papers.

Later in the evening, we all gathered in the Great Hall, the largest room in the exhibition, for some opening remarks from the conference organizer. Before that, we had one of the more interesting conference events: a performance by indigenous Australians. To put a long story short, in Brisbane it’s common (according to my sign language interpreter) to begin large events by allowing indigenous Australians to perform some demonstration. This is a sign of respect for how these people inhabited Australia for many thousands of years.

For the show, several shirtless men with paint on them played music and danced. They rubbed wood with some other device and created some smoke and fire. Perhaps they got this cleared through the building’s security? I hope so. I took a photo of their performance, which you see below. Unfortunately it’s not the one with the smoke and fire.

Native Australians giving us a show.

I don’t know if anyone else felt this way, but does it feel awkward seeing “natives” (whether Australian or American) wearing almost nothing while “privileged Asians and Whites” like me sit in the audience wearing business attire with our mandatory iPhones and Macbook Pro laptops in hand? Please don’t get me wrong: I fully respect and applaud the conference organizers and the city of Brisbane as a whole for encouraging this type of respect; I just wonder if there are perhaps better ways to do this. It’s an open question, I think, and no, ignoring history in America-style fashion is not a desirable alternative.

After the natives gave their show, to which they received rousing applause, the conference chair (Professor Peter Corke) provided some welcoming remarks and then conference statistics such as the ones shown in the following photo:

Some statistics from conference chair Peter Corke.

There are lots of papers at ICRA! Here are a few relevant statistics from this and other slides (not shown in this blog post):

• The acceptance rate was 40%, resulting in 1052 papers accepted. That’s … a lot! Remember, at least one author of each paper is supposed to attend the conference, but in reality several often attend, along with industry sponsors and so forth, so the number of attendees is surely much higher than 1052 even when accounting for how several researchers can first-author multiple ICRA papers.
• Papers with authors from Venezuela had a 100% paper acceptance rate. I guess it’s good to find the positives in Venezuela, given the country’s recent free-fall, which won’t be mitigated by their sham election.
• The 2018 edition of ICRA broke the record of the number of paper submissions with 2586. The previous high was from 2016, which had around 2350 paper submissions.
• The United States had the highest number of papers submitted by country, besting the next set of countries which were China, Germany, and France. When you scale it by a country’s population, Singapore comes first (obviously!), followed by Switzerland, Australia (woo hoo, home team!!), Denmark, and Sweden. It’s unclear what these statistics tracked if authors of papers were based in different countries.

After this, we all went over to the welcome reception.

The first thing I noticed: wow, this is going to be noisy and crowded. At least I would have a sign language interpreter who would tag along with me, which despite being awkward from a social perspective is probably the best I can hope for.

The second thing I noticed: wow, there are lots of booths that provide wine and beer. Here’s one of many:

One of many drinking booths in the welcome reception on Monday night.

To satisfy our need for food, several convention employees would walk around with some finger food in their large plates. I learned from one of the sign language interpreters who was tagging along with me that one of the food samplings offered was a kangaroo dish. Apparently, kangaroo is a popular meat item in Australia.

It is also quite tasty.

There were a large number of booths for ICRA sponsors, various robotics competitions, or other demonstrations. For instance, here’s one of the many robotics demonstrations, this time for “field robotics,” I suppose:

One of many robotics booths set up in the welcome reception.

And it wouldn’t be a (well-funded) Australian conference if we didn’t get to pet some animals. There were snakes and wombats (see below image) for us to touch:

We could pet a wombat in the welcome reception (plus snakes and other animals).

I’ll tell you this: ICRA does not skimp on putting on a show. There was a lot to process, and unusually for me, I fell asleep extremely quickly once I got back to my hotel room.

1. Not that it was a challenging task, since I was sitting in the front row and he’s seen the sign language interpreters at Berkeley many times.

# Prelude to ICRA 2018, Day 0 of 5

On Friday, May 18, I bade farewell to my apartment and my work office to go to San Francisco International Airport (SFO). Why? I was en route to ICRA 2018, the premier conference on robotics and — I believe — its largest in terms of number of papers, conference attendees, and the sheer content offered in the form of various tutorials, workshops, and sponsor events.

For travel, I booked an Air Canada round trip from San Francisco to Vancouver to Brisbane. Yes, I had to go north and then go south …. there unfortunately weren’t any direct flights from San Francisco to Brisbane during my time frame (San Francisco to Sydney is a more popular route). But I didn’t mind, as I could finally stop using United Airlines.

As usual, I got to the airport early, and then hiked over to the International Terminal. At SFO, I’m most familiar with Terminal 3 (United) and the International Terminal (for international travel) and for the latter, my favorite place to eat is Napa Farms Market, which embodies the essence of San Francisco cuisine. I had some excellent pork which was cut in-house, and cauliflower rice (yeah, see what I said about SF?).

The SFO Napa Farms Market.

Incidentally, for Terminal 3 dining, I highly recommend Yankee Pier and their fish dishes.

My original plan was to pass security at around 2:00pm (which I did), then get a nice lunch and relax at the gate before my scheduled 4:20pm departure time. Unfortunately, while I was eating my Napa Farms Market dish, the waterfall of delays would begin. After three separate delays, I soon learned that my flight to Vancouver wouldn’t depart until after 7:10pm. At least, assuming there weren’t any more delays after that.

Ouch, apparently United isn’t the only airline that’s struggling to keep things on time. Maybe it’s a San Francisco issue; is there too much traffic? Or could it be due to the airport’s awkward location? It’s a bit oddly situated in the bay; it borders the inside of the bay, rather than the great Pacific Ocean.

The good news was twofold, though:

• I had a spare United Club pass that I could use to enter a club. Fortunately, it turns out that even if you are not flying United, you can still get access to the clubs with a same-day boarding pass and a club pass. I don’t know if it helped that Air Canada is part of Star Alliance, but I’m guessing this would be OK even if I was flying Air Whatchamacallit.
• I had arranged for a five hour layover at Vancouver. My flight to Brisbane was scheduled to leave past 11:00pm.

This is why I aim for layovers of 5+ hours. It gives me an enormous buffer zone in case of any delays (and at SFO, I expect them) and I really don’t want to be late for academic conferences. Furthermore, the delays let me relax at airports for a longer time (assuming you have access to lounges!) which means I can string together a longer time block reading papers, reading books, and blogging, all while “fine dining” from a graduate student’s perspective.

I went to the United Club at Terminal 3 since that’s the largest one, and the lounge I’m most familiar with due to all my domestic United travels. I found a nice place to work (more accurately, read a research paper) and enjoyed some of the free food. I had a cappuccino, some cheese, crackers, fruit, and then enjoyed the standard complimentary house wine.

The good news was that the flight wasn’t delayed too much longer, but I’m not sure how much a 3+ delay reflects on Air Canada. Hopefully it’s a rare event. I boarded the flight and was soon in Vancouver.

The flight itself was non-eventful and I don’t remember what I did. I entered Vancouver, and saw some impressive artwork when I arrived.

Vancouver artwork.

I couldn’t waste too much time, though. I had to pass through immigration. Remember, United States =/= Canada. Just be warned: if you are arriving at a country from a different country, even if it is en route to yet another country, you still have to go through immigration and customs and then the whole security pipeline after that. Fortunately, despite how the line looks long in the photo here, I got through it quickly. Note also the sign language interpreter video.

Immigration at Vancouver.

I quickly hurried over to my gate and managed to get some decent snacks at one of the few places that was open at 10:30pm. The good news is that I informed my credit card company that I was traveling in Canada and Australia, so I could use my Credit Card. Pro tip for anyone who is traveling! You don’t want your card to be declined in the most awkward of moments.

And then … I boarded the fourteen hour and a half flight to Brisbane. Ouch! A few things about it:

• There was free wine, so apparently the Air Canada policy is the same as United for long-haul international flights. I wonder, though, why they offered the wine an hour after we departed (which was past midnight in Vancouver and SFO time) when it seemed like most passengers wanted to sleep. I thought they would have held off providing wine until perhaps the 10-hour mark of the flight, but I guess not. I drank some red wine and it was OK to me, though another conference attendee (who I met once I arrived in Brisbane) told me she hated it.
• The flight offered three meals (!!), whereas I thought only one (or at most two) would be provided. That’s nice of them. I had requested a “fruit plate” dish upon paying for the tickets a few months ago, because I do not trust meat on airline food. To my surprise, the airline respected the request and gave me two fruit plates. I would have had a third, except I declined the last one since I decided to try out the egg dish for the third one, but really, thank you Air Canada for honoring my food request! I’ll remember this for the future.
• The guy who was in my set of three seats (sitting by the window) had an Nvidia backpack, so I asked him about that, and it turned out he’s another conference attendee. I also saw a few Berkeley students who I recognized on the plane. I’m also pretty sure that other conference attendees could tell I was going because I was awkwardly dragging a bulky poster tube.
• I sat in the aisle seat. This is what you want for a 14.5-hour flight, because it makes it so much easier to leave the seat and walk around, which I had to do several times. Always get aisle seats for long flights!! Amazingly, as far as I could tell, the Nvidia guy sitting at the window never left his seat for the entire flight. I don’t think my body could handle that.
• The flight offered the usual fare of electronic entertainment (movies, card games, etc.) but I mostly slept (itself an accomplishment for me!) and read Enlightenment Now — more on the book later!

I arrived in Brisbane, then went through immigration again. But before that, I passed through the following duty-free store which was rather judiciously placed so that passengers had to pass it before getting to immigration and customs:

The duty-free area we passed through upon arriving to Brisbane.

There is a LOT of alcohol here! I enjoy some drinks now and then, but I wonder if society worldwide is putting way too much emphasis on booze.

Going through immigration was no problem at all. Getting my taxi driver to go to my hotel was a different matter. We had a testy conversation when I asked him to drive me to my hotel. I kept asking him to repeat and repeat his question since I couldn’t understand what he was saying, but then I explicitly showed him the address which I had printed on a paper beforehand, and then he was fine. Yeah, lesson learned — just give them a written address and they’ll be fine. Fortunately, I made it to the hotel at 9:00am, but check-in wasn’t until 2:00pm (as expected), so I left my luggage there and decided to explore the surrounding Brisbane area.

ICRA this year is located at Brisbane Convention Centre which is in the South Bank Parklands. I could tell that there’s a LOT to do. One of John Canny’s students told me a few days ago that “there isn’t much to do in Brisbane” so I had to send a friendly rebuke to him via text message.

I wandered around before finding a place to eat breakfast (in Australian time) which was more accurately lunch for me. I had some spinach, tomatoes, poached eggs, zucchini bread, and chai tea:

My lunch in the South Bank, Brisbane.

I’m not much of a poached egg person but it was great! And adding milk to the standard hot water and tea seems like an interesting combination that I’ll try out more frequently in the future.

I took a few more pictures of the South Bank. Here’s one which shows an artificial beach within the park on the left, with the real river on the right.

The artificial beach (left) and South Bank river (right).

Here’s a better view of the “beach”:

A view of the "beach."

There’s tons of great stuff here: playgrounds, more restaurants, more extensive beaches, some architectural and artistic pieces, and so forth. I can’t take pictures of everything, unfortunately, so please check out the South Bank website which should make you want to start booking some flights.

Just make sure you schedule your mandatory five-hour layover if SFO is part of your itinerary.

I wandered around a bit and found a nice place to relax:

One of many places to relax in South Bank.

I sat on a bench and read, Enlightenment Now: The Case for Reason, Science, Humanism, and Progress. This is Bill Gates’ favorite book, and is close to being one of my favorites as well. I’ll blog about it later, but the main theme is that, despite what we may think of from the media and what politicians say, life has continued to be much better for humanity as a whole, and there is never any better time to live in the world than today. I think my trip to Brisbane epitomizes this:

• I can board a 14.5 hour flight to go from Canada to Vancouver and trust that the safety record will result in a safe landing.
• I can study robotics and AI, instead of engaging in backbreaking agriculture labor.
• I have the freedom to walk around safely and read books as I please, rather than constantly worry about warfare or repercussions for reading non-government sanctioned books.
• I have the ability to easily check in a hotel well in advance, to know that I will have a roof over my head.
• I can blog and share the news about this to friends and family around the world.

Pinker doesn’t ignore the obvious hardships that many people face nowadays, but he makes a strong case that we are not focusing enough on the positive trends (e.g., decline in worldwide extreme poverty) and, more importantly, what we can learn from them so that they continue rather than slide back.

I devoured Enlightenment Now for about an hour or two and took a break — it’s a 500-page book filled with dense endnotes — and toured more of the South Bank. Overall, the place is extremely impressive and great for tourists of all shapes and sizes. Here are some (undoubtedly imprecise and biased) tradeoffs between this and the Darling Harbor area that I went to for UAI 2017:

• Darling Harbor advantages: more high-end restaurants, better cruises, feels cleaner and wealthier
• South Bank advantages: a larger variety of events (many family-friendly), perhaps cheaper food, better running routes

The bottom line is that both areas are great and if I had to pick any place to visit, it would probably be the one that I haven’t been to the longest.

I then went back to my hotel at around 3pm, desperate to relax and shower. The hotel I was in, Ibis Brisbane, is one of the cheaper ones here, and it shows in what it provides. The WiFi is sketchy, the electrical outlets are located in awkward configurations, there is no moisturizing cream, only two large towels are offered, and there is no fitness center (really?!?!?!?!?!?!?!?!?).

It’s not as good as the hotel I stayed at in Sydney, but at least it’s a functioning hotel and I can stay here for six nights without issues.

I showered and went out to explore to get some sources of nutrition. I found a burger place and was going to eat quickly and head back to the hotel … when, as luck would have it, six other Berkeley students decided to come to the place as soon as I was about to leave. They generously allowed me to join their group. It’s a good thing I was wearing a “Berkeley Computer Science” jacket!

I followed them to their hotel, which was heads and shoulders better than mine. Their place was a full studio with a balcony, and the view was great:

The view of the South Bank at night from high up.

I stayed for a while, then walked back to my hotel. I slept early, reaping the benefits of judiciously drinking coffee and timing meals so that I can wake up early and feel refreshed for tomorrow.

# Practicing ROS Programming

I am currently engaging in a self-directed, badly-needed crash course on ROS Programming. ROS (Robot Operating System) is commonly used for robotics programming and research, and is robot-agnostic so knowledge of ROS should generalize to different robot types. Yet even after publishing a robotics paper, I still didn’t feel like I understood how my ROS code was working under the hood since other students had done much of the lower-level stuff earlier. This blog post thus summarizes what I did to try and absorb ROS as fast as possible.

To start learning about ROS, it’s a good idea (indeed, perhaps mandatory) to take a look at the excellent ROS Wiki. ROS is summarized as:

ROS (Robot Operating System) provides libraries and tools to help software developers create robot applications. It provides hardware abstraction, device drivers, libraries, visualizers, message-passing, package management, and more. ROS is licensed under an open source, BSD license.

The ROS wiki is impressively rich and detailed. If you scroll down and click “Tutorials”, you will see (as of this writing) twenty for beginners, and eight for more advanced users. In addition, the Wiki offers a cornucopia of articles related to ROS libraries, guidelines, and so on.

It’s impossible to read all of this at once, so don’t! Stick with the beginner tutorials for now, and try and remember as much as you can. I recorded my notes in my GitHub repository for my “self-studying” here. (Incidentally, that repository is something I’m planning to greatly expand this summer with robotics and machine learning concepts.)

As always, it is faster to learn by doing and reading, rather than reading alone, so it is critical to run the code in the ROS tutorials. Unfortunately, the code they use involves manipulating a “Turtle Sim” robot. This is perhaps my biggest disappointment with the tutorials: the turtle is artificial and hard to relate to real robots. Of course, this is somewhat unavoidable if the Wiki (and ROS as a whole) wants to avoid providing favoritism to certain robots, so perhaps it’s not fair criticism, but I thought I’d bring it up anyway.

To alleviate the disconnect between a turtle and what I view as a “real robot,” it is critical to start running this on a real robot. But since real robots run on the order of many thousands of dollars and exhibit all the vagaries of what you would expect from complex, physical systems (wear and tear, battery drainage, breakdowns, etc.), I highly recommend that you start by using a simulator.

In the AUTOLAB, I have access to a Fetch and a Toyota HSR, both of which provide a built-in simulator using Gazebo. This simulator is designed to create a testing environment where I can move and adjust the robot in a variety of ways, without having to deal with physical robots. The advantage of investing time in testing the simulator is that the code one uses for that should directly translate to the real, physical robot without any changes, apart from adjusting the ROS_MASTER_URI environment variable.

Details on the simulator should be provided in the manuals that you get for the robots. Once the simulator is installed (usually via sudo apt-get install) and working, the next step is to figure out how to code. One way to do this is to borrow someone’s existing code base and tweak it as desired.

For the Fetch, my favorite code base is the one used in the University of Washington’s robotics course. It is a highly readable, modular code base which provides a full-blown Python Fetch API with much of the stuff I need: arm movement, base movement, head movement, etc. On top of that, there’s a whole set of GitHub wiki pages which provides high-level descriptions of how ROS and other things work. When I was reading these — which was after I had done a bit of ROS programming — I was smiling and nodding frequently, as the tutorials had confirmed some of what I had assumed was happening.

The primary author of the code base and Wiki is Justin Huang, a PhD student with Professor Maya Cakmak. Justin, you are awesome!

I ended up taking bits and pieces from Justin’s code, and added a script for dealing with camera images. My GitHub code repository here contains the resulting code, and this is the main thing I used to learn ROS programming. I documented my progress in various README files in that repository, so if you’re just getting started with ROS, you might find it helpful.

Playing around with the Gazebo simulator, I was able to move the Fetch torso to its highest position and then assign the joint angles so that it’s gripper actually coincides with the base. Oops, heh, I suppose that’s a flaw with the simulator?

The amusing result when you command the Fetch's arm to point directly downards. The gripper "cuts" through the base, which can't happen on the physical robot.

Weird physics notwithstanding, the Gazebo simulator has been a lifesaver for me in understanding ROS, since I can now see the outcome on a simulated version of the real robot. I hope to continue making progress in learning ROS this summer, and to use other tools (such as rviz and MoveIt) that could help accelerate my understanding.

I’m currently en route to the International Conference on Robotics and Automation (ICRA) 2018, which should provide me with another environment for massive learning on anything robotics. If you’re going to ICRA and would like to chat, please drop me a line.

# Interpretable and Pedagogical Examples

In my last post, I discussed a paper on algorithmic teaching. I mentioned in the last paragraph that there was a related paper, Interpretable and Pedagogical Examples, that I’d be interested in reading in detail. I was able to do that sooner than expected, so naturally, I decided to blog about it. A few months ago, OpenAI had a blog post discussing the contribution and ramifications of the paper, so I’m hoping to focus more on stuff they didn’t cover to act as a complement.

This paper is currently “only” on arXiv as it was rejected from ICLR 2018 — not due to lack of merit, it seems, but because the authors had their names on the manuscript, violating the double-blind nature of ICLR. I find it quite novel, though, and hope it finds a home somewhere in a conference.

There are several contributions of this over prior work in machine teaching and the like. First, they use deep recurrent neural networks for both the student and the teacher. Second and more importantly, they show that with iterative — not joint — training, the teacher will teach using an interpretable strategy that matches human intuition, and which furthermore is efficient in conveying concepts with the fewest possible samples (hence, “pedagogical”). This paper focus on teaching by example, but there are other ways to teach, such as using pairwise comparisons as in this other OpenAI paper.

How does this work? We consider a two-agent environment with a student $\mathbf{S}$ and a teacher $\mathbf{T}$, both of which are parameterized by deep recurrent neural networks $\theta_{\mathbf{S}}$ and $\theta_{\mathbf{T}}$, respectively. The setting also involves a set of concepts $\mathcal{C}$ (e.g., different animals) and examples $\mathcal{E}$ (e.g., images of those animals).

The student needs to map a series of $K$ examples to concepts. At each time step $t$, it guesses the concept $\hat{c}$ that the teacher is trying to convey. The teacher, at each time step, takes in $\hat{c}$ along with the concept it is trying to convey, and must output an example that (ideally) will make $\hat{c}$ “closer” to $c$. Examples may be continuous or discrete.

As usual, to train $\mathbf{S}$ and $\mathbf{T}$, it is necessary to devise an appropriate loss function $\mathcal{L}$. In this paper, the authors chose to have $\mathcal{L}$ be a function from $\mathcal{C}\times \mathcal{C} \to \mathbb{R}$ where the input is the true concept and the student’s concept after the $K$ examples. This is applied to both the student and teacher; they use the same loss function and are updated via gradient descent. Intuitively, this makes sense: both the student and teacher want the student to know the teacher’s concept. The loss is usually the $L_2$ (continuous) or the cross-entropy (discrete).

A collection of important aspects from the paper "Interpretable and Pedagogical Examples." Top left: a visualization of the training process. Bottom left: joint training baseline which should train the student but not create interpretable teaching strategies. Right: iterative training procedure which should create interpretable teaching strategies.

The figure above includes a visualization of the training process. It also includes both the joint and iterative training procedures. The student’s function is written as $\mathbf{S}(e_k | \theta_{\mathbf{S}})$, and this is what is used to produce the next concept. The authors don’t explicitly pass in the previous examples or the student’s previously predicted concepts (the latter of which would make this an “autoregressive” model) because, presumably, the recurrence means the hidden layers implicitly encode the essence of this prior information. A similar thing is seen with how one writes the teacher’s function: $\mathbf{T}(c_i, \hat{c}_{i,k-1} | \theta_{\mathbf{T}})$.

The authors argue that joint training means the teacher and student will “collude” and produce un-interpretable teaching, while iterative training lets them obtain interpretable teaching strategies. Why? They claim:

The intuition behind separating the optimization into two steps is that if $\mathbf{S}$ learns an interpretable learning strategy in Step 1, then $\mathbf{T}$ will be forced to learn an interpretable teaching strategy in Step 2. The reason we expect $\mathbf{S}$ to learn an “interpretable” strategy in Step 1 is that it allows $\mathbf{S}$ to learn a strategy that exploits the natural mapping between concepts and examples.

I think the above reason boils down to the fact that the teacher “knows” the true concepts $c_1,\ldots,c_n$ in the minibatch of concepts above, and those are fixed throughout the student’s training portion. Of course, this would certainly be easier to understand after implementing it in code!

The experimental results are impressive and cover a wide range of scenarios:

• Rule-Based: this is the “rectangle game” from cognitive science, where teachers provide points within a rectangle, and the student must guess the boundary. The intuitive teaching strategy would be to provide two points at opposite corners.

• Probabilistic: the teacher must teach a bimodal mixture of Gaussians distribution, and the intuitive strategy is to provide points at the two modes (I assume, based on the relative weights of the two Gaussians).

• Boolean: how does the teacher teach an object property, when objects may have multiple properties? The intuitive strategy is to provide two points where, of all the properties in the provided/original dataset, the only one that the two have in common is what the teacher is teaching.

• Hierarchical: how does a teacher teach a hierarchy of concepts? The teacher learns the intuitive strategy of picking two examples whose lowest common ancestor is the concept node. Here, the authors use images from a “subtree” of ImageNet and use a pre-trained Res-Net to cut the size of all images to be vectors in $\mathbb{R}^{2048}$.

For the first three above, the loss is $\mathcal{L}(c,\hat{c}) = \|c-\hat{c}\|_2^2$, whereas the fourth problem setting uses the cross entropy.

There is also evaluation that involves human subjects, which is the second definition of “interpretability” the authors invoke: how effective is $\mathbf{T}$’s strategy at teaching humans? They do this using the probabilistic and rule-based experiments.

Overall, this paper is enjoyable to read, and the criticism that I have is likely beyond the scope that any one paper can cover. One possible exception: understanding the neural network architecture and training. The architecture, for instance, is not specified anywhere. Furthermore, some of the training seemed excessively hand-tuned. For example, the authors tend to train using $X$ examples for $K$ iterations but I wonder if these needed to be tuned.

I think I would like to try implementing this algorithm (using PyTorch to boot!), since it’s been a while since I’ve seriously tried replicating a prior result.

# Algorithmic and Human Teaching of Sequential Decision Tasks

I spent much of the last few months preparing for the UC Berkeley EECS PhD qualifying exams, as you might have been able to tell by the style of my recent blogging (mostly paper notes) and my lack of blogging for the last few weeks. The good news is that I passed the qualifying exam. Like I did for my prelims, I wrote a “transcript” of the event. I will make it public in a future date. In this post, I discuss an interesting paper that I skimmed for my quals but didn’t have time to read in detail until after the fact: Algorithmic and Human Teaching of Sequential Decision Tasks, a 2012 AAAI paper by Maya Cakmak and Manuel Lopes.

This paper is interesting because it offers a different perspective on how to do imitation learning. Normally, in imitation learning, there is a fixed set of expert demonstrations $D_{\rm expert} = \{\tau_1, \ldots, \tau_K \}$ where each demonstration $\tau_i = (s_0,a_0,s_1\ldots,a_{N-1},s_N)$ is a sequence of states and actions. Then, a learner has to run some algorithm (classically, either behavior cloning or inverse reinforcement learning) to train a policy $\pi$ that, when executed in the same environment, is as good as the expert.

In many cases, however, it makes sense that the teacher can select the most informative demonstrations for the student to learn a task. This paper thus falls under the realm of Active Teaching. This is not to be confused with Active Learning, as they clarify here:

A closely related area for the work presented in this paper is Active Learning (AL) (Angluin 1988; Settles 2010). The goal of AL, like in AT, is to reduce the number of demonstrations needed to train an agent. AL gives the learner control of what examples it is going to learn from, thereby steering the teacher’s input towards useful examples. In many cases, a teacher that chooses examples optimally will teach a concept significantly faster than an active learner choosing its own examples (Goldman and Kearns 1995).

This paper sets up the student to internally run inverse reinforcement learning (IRL), and follows prior work in assuming that the value function can be written as:

where in (i) I applied the definition of a value function when following policy $\pi$ (for notational simplicity, when I write a state under the expectation, like $\mathbb{E}_s$, that means the expectation assumes we start at state $s$), in (ii) I substituted the reward function by assuming it is a linear combination of $k$ features, in (iii) I re-arranged, and finally in (iv) I simplified in vector form using new notation.

We can augment the $\bar{\mu}$ notation to also have the initial action that was chosen, as in $s_a$. Then, using the fact that the IRL agent assumes that: “if the teacher chooses action $a$ in state $s$, then $a$ must be at least as good as all the other available actions in $s$”, we have the following set of constraints from the demonstration data $D$ consisting of all trajectories:

The paper’s main technical contribution is as follows. They argue that the above set of (half-space) constraints results in a subspace $c(D)$ that contains the true weight vector, which is equivalent to obtaining the true reward function assuming we know the features. The weights are assumed to be bounded into some hypercube, $% $. By sampling $N$ different weight vectors $\bar{w}_i$ within that hypercube, they can check the percentage of sampled weights that lie within that true subspace with this (indirect) metric:

Mathematically their problem is to find the set of demonstrations $D$ that maximizes $G(D)$, because if that value is larger, then the sampled weights are more likely to satisfy all the constraints, meaning that it has the property of representing the true reward function.

Note carefully: we’re allowed to change $D$, the demonstration set, but we can’t change the way the weights are sampled: they have to be sampled from a fixed hypercube.

Their algorithm is simple: do a greedy approximation. First, select a starting state. Then, select the demonstration $\tau_j$ that increases the current $G( \{D \cup \tau_j \} )$ value the most. Repeat until $G(D)$ is high enough.

For experiments, the paper relies on two sets of Grid-World mazes, shown below:

The two grid-worlds used in the paper.

Each of these domains has three features, and furthermore, only one is “active” at a given square in the map, so the vectors are all one-hot. Both domains have two tasks (hence, there are four tasks total), each of which is specified by a particular value of the feature weights. This is the same as specifying a reward function, so the optimal path for an agent may vary.

The paper argues that their algorithm results in the most informative demonstration in the teacher’s set. For the first maze, only one demonstration is necessary to convey each of the two tasks offered: for the second, only two are needed for the two tasks.

From observing example outcomes of the optimal teaching algorithm we get a better intuition about what constitutes an informative demonstration for the learner. A good teacher must show the range of important decision points that are relevant for the task. The most informative trajectories are the ones where the demonstrator makes rational choices among different alternatives, as opposed to those where all possible choices would result in the same behavior.

That the paper’s experiments involve these hand-designed mazes is probably one of the main weaknesses. There’s no way this could extend to high dimensions, when sampling from a hypercube (even if it’s “targeted” in some way, and not sampled naively) would never result in a weight vector that satisfies all the IRL constraints.

To conclude, this AAAI paper, though short and limited in some ways, provided me with a new way of thinking about imitation learning with an active teacher.

Out of curiosity about follow-up, I looked at the Google Scholar of papers that have cited this. Some interesting ones include:

• Cooperative Inverse Reinforcement Learning, NIPS 2016
• Showing versus Doing: Teaching by Demonstration, NIPS 2016
• Enabling Robots to Communicate Their Objectives, RSS 2017

I’m surprised, though, that one of my recent favorite papers, Interpretable and Pedagogical Examples, didn’t cite this one. That one is somewhat similar to this work except it uses more sophisticated Deep Neural Networks within an iterative training procedure, and has far more impressive experimental results. I hope to talk about that paper in a future blog post and to re-implement it in code.

# One-Shot Visual Imitation Learning via Meta-Learning

A follow-up paper to the one I discussed in my previous post is One-Shot Visual Imitation Learning via Meta-Learning. The idea is, again, to train neural network parameters $\theta$ on a distribution of tasks such that the parameters are easy to fine-tune to new tasks sampled from the distribution. In this paper, the focus is on imitation learning from raw pixels and showing the effectiveness of a one-shot imitator on a physical PR2 robot.

Recall that the original MAML paper showed the algorithm applied to supervised regression (for sinusoids), supervised classification (for images), and reinforcement learning (for MuJoCo). This paper shows how to use MAML for imitation learning, and the extension is straightforward. First, each imitation task $\mathcal{T}_i \sim p(\mathcal{T})$ contains the following information:

• A trajectory $\tau = \{o_1,a_1,\ldots,o_T,a_T\} \sim \pi_i^*$ consists of a sequence of states and actions from an expert policy $\pi_i^*$. Remember, this is imitation learning, so we can assume an expert. Also, note that the expert policy is task-specific.

• A loss function $\mathcal{L}(a_{1:T},\hat{a}_{1:T}) \to \mathbb{R}$ providing feedback on how closely our actions match those of the expert’s.

Since the focus of the paper is on “one-shot” learning, we assume we only have one trajectory available for the “inner” gradient update portion of meta-training for each task $\mathcal{T}_i$. However, if you recall from MAML, we actually need at least one more trajectory for the “outer” gradient portion of meta-training, as we need to compute a “validation error” for each sampled task. This is not the overall meta-test time evaluation, which relies on an entirely new task sampled from the distribution (and which only needs one trajectory, not two or more). Yes, the terminology can be confusing. When I refer to “test time evaluation” I always refer to when we have trained $\theta$ and we are doing few-shot (or one-shot) learning on a new task that was not seen during training.

All the tasks in this paper use continuous control, so the loss function for optimizing our neural network policy $f_\theta$ can be described as:

where the first sum normally has one trajectory only, hence the “one-shot learning” terminology, but we can easily extend it to several sampled trajectories if our task distribution is very challenging. The overall objective is now:

and one can simply run Adam to update $\theta$.

This paper uses two new techniques for better performance: a two-headed architecture, and a bias transformation.

• Two-Headed Architecture. Let $y_t^{(j)}$ be the vector of post-activation values just before the last fully connected layer which maps to motor torques. The last layer has parameters $W$ and $b$, so the inner loss function $\mathcal{L}_{\mathcal{T}_i}(f_\theta)$ can be re-written as:

where, I suppose, we should write $\phi = (\theta, W, b)$ and re-define $\theta$ to be all the parameters used to compute $y_t^{(j)}$.

In this paper, the test-time single demonstration of the new task is normally provided as a sequence of observations (images) and actions. However, they also experiment with the more challenging case of removing the provided actions for that single test-time demonstration. They simply remove the action and use this inner loss function:

This is still a bit confusing to me. I’m not sure why this loss function leads to the desired outcome. It’s also a bit unclear how the two-headed architecture training works. After another read, maybe only the $W$ and $b$ are updated in the inner portion?

The two-headed architecture seems to be beneficial on the simulated pushing task, with performance improving by about 5-6 percentage points. That may not sound like a lot, but this was in simulation and they were able to test with 444 total trials.

The other confusing part is that if we assume we’re allowed to have access to expert actions, then the real-world experiment actually used the single-headed architecture, and not the two-headed one. So there wasn’t a benefit to the two-headed one assuming we have actions. Without actions, of course, the two-headed one is our only option.

• Bias Transformation. After a certain neural network layer (which in this paper is after the 2D spatial softmax applied after the convolutions to process the images), they concatenate this vector of parameters. They claim that

[…] the bias transformation increases the representational power of the gradient, without affecting the representation power of the network itself. In our experiments, we found this simple addition to the network made gradient-based meta-learning significantly more stable and effective.

However, the paper doesn’t seem to show too much benefit to using the bias transformation. A comparison is reported in the simulated reaching task, with a dimension of 10, but it could be argued that performance is similar without the bias transformation. For the two other experimental domains, I don’t think they reported with and without the bias transformation.

Furthermore, neural networks already have biases. So is there some particular advantage to having more biases packed in one layer, and furthermore, with that layer being the same spot where the robot configuration is concatenated with the processed image (like what people do with self-supervision)? I wish I understood. The math that they use to justify the gradient representation claim makes sense; I’m just missing a tiny step to figure out its practical significance.

They ran their setups on three experimental domains: simulated reaching, simulated pushing, and (drum roll please) real robotic tasks. For these domains, they seem to have tested up to 5.5K demonstrations for reaching and 8.5K for pushing. For the real robot, they used 1.3K demonstrations (ouch, I wonder how long that took!). The results certainly seem impressive, and I agree that this paper is a step towards generalist robots.

# Model-Agnostic Meta-Learning

One of the recent landmark papers in the area of meta-learning is MAML: Model-Agnostic Meta-Learning. The idea is simple yet surprisingly effective: train neural network parameters $\theta$ on a distribution of tasks so that, when faced with a new task, can be rapidly adjusted through just a few gradient steps. In this post, I’ll briefly go over the notation and problem formulation for MAML, and meta-learning more generally.

Here’s the notation and setup, mostly following the paper:

• The overall model $f_\theta$ is what MAML is optimizing, with parameters $\theta$. We denote $\theta_i'$ as weights that have been adapted to the $i$-th task through one or more gradient steps. Since MAML can be applied to classification, regression, reinforcement learning, and imitation learning (plus even more stuff!) we generically refer to $f_\theta$ as mapping from inputs $x_t$ to outputs $a_t$.

• A task $\mathcal{T}_i$ is defined as a tuple $(T_i, q_i, \mathcal{L}_{\mathcal{T}_i})$, where:

• $T_i$ is the time horizon. For (IID) supervised learning problems like classification, $T_i=1$. For reinforcement learning and imitation learning, it’s whatever the environment dictates.

• $q_i$ is the transition distribution, defining a prior over initial observations $q_i(x_1)$ and the transitions $q_i(x_{t+1}\mid x_{t},a_t)$. Again, we can generally ignore this for simple supervised learning. Also, for imitation learning, this reduces to the distribution over expert trajectories.

• $\mathcal{L}_{\mathcal{T}_i}$ is a loss function that maps the sequence of network inputs $x_{1:T}$ and outputs $a_{1:T}$ to a scalar value indicating the quality of the model. For supervised learning tasks, this is almost always the cross entropy or squared error loss.

• Tasks are drawn from some distribution $p(\mathcal{T})$. For example, we can have a distribution over the abstract concept of doing well at “block stacking tasks”. One task could be about stacking blue blocks. Another could be about stacking red blocks. Yet another could be stacking blocks that are numbered and need to be ordered consecutively. Clearly, the performance of meta-learning (or any alternative algorithm, for that matter) on optimizing $f_\theta$ depends on $p(\mathcal{T})$. The more diverse the distribution’s tasks, the harder it is for $f_\theta$ to quickly learn new tasks.

The MAML algorithm specifically finds a set of weights $\theta$ that are easily fine-tuned to new, held-out tasks (for testing) by optimizing the following:

This assumes that $\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)$. It is also possible to do multiple gradient steps, not just one. Thus, if we do $K$-shot learning, then $\theta_i'$ is obtained via $K$ gradient updates based on the task. However, “one shot” is cooler than “few shot” and also easier to write, so we’ll stick with that.

Let’s look at the loss function above. We are optimizing over a sum of loss functions across several tasks. But we are evaluating the (outer-most) loss functions while assuming we made gradient updates to our weights $\theta$. What if the loss function were like this:

This means $f_\theta$ would be capable of learning how to perform well across all these tasks. But there’s no guarantee that this will work on held-out tasks, and generally speaking, unless the tasks are so closely related, it shouldn’t work. (I’ve tried doing some similar stuff in the past with the Atari 2600 benchmark where a “task” was “doing well on game X”, and got networks to optimize across several games, but generalization was not possible without fine-tuning.) Also, even if we were allowed to fine-tune, it’s very unlikely that one or few gradient steps would lead to solid performance. MAML should do better precisely because it optimizes $\theta$ so that it can adapt to new tasks with just a few gradient steps.

MAML is an effective algorithm for meta-learning, and one of its advantages over other algorithms such as ${\rm RL}^2$ is that it is parameter-efficient. The gradient updates above do not introduce extra parameters. Furthermore, the actual optimization over the full model $\theta$ is also done via SGD

again introducing no new parameters. (The update is actually Adam if we’re doing supervised learning, and TRPO if doing RL, but SGD is the foundation of those and it’s easier for me to write the math. Also, even though the updates may be complex, I think the inner part, where we have $f_{\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)}$, I think that is always vanilla SGD, but I could be wrong.)

I’d like to emphasize a key point: the above update mandates two instances of $\mathcal{L}_{\mathcal{T}_i}$. One of these — the one in the subscript to get $\theta_i'$ should involve the $K$ training instances from the task $\mathcal{T}_i$ (or more specifically, $q_i$). The outer-most loss function should be computed on testing instances, also from task $\mathcal{T}_i$. This is important because we want our ultimate evaluation to be done on testing instances.

Another important point is that we do not use those “testing instances” for evaluating meta-learning algorithms, as that would be cheating. For testing, one takes a held-out set of test tasks entirely, adjusts $\theta$ for however many steps are allowed (one in the case of one-shot learning, etc.) and then evaluates according to whatever metric is appropriate for the task distribution.

In a subsequent post, I will further investigate several MAML extensions.

# Zero-Shot Visual Imitation

In this post, i will further investigate one of the papers i discussed in an earlier blog post: Zero-Shot Visual Imitation (Pathak et al., 2018).

For notation, I denote states and actions at some time step $t$ as $s_t$ and $a_t$, respectively, if they were obtained through the agent exploring in the environment. A hat symbol, $\hat{s}_t$ or $\hat{a}_t$, refers to a prediction made from some machine learning model.

Basic forward (left) and inverse (right) model designs.

Recall the basic forward and inverse model structure (figure above). A forward model takes in a state-action pair and predicts the subsequent state $\hat{s}_{t+1}$. An inverse model takes in a current state $s_t$ and some goal state $s_g$, and must predict the action that will enable the agent go from $s_t$ to $s_t$.

• It’s easiest to view the goal input to the inverse model as either the very next state $s_{t+1}$, or the final desired goal of the trajectory, but some papers also use $s_g$ as an arbitrary checkpoint (Agrawal et al., 2016, Nair et al., 2017, Pathak et al., 2018). For the simplest model, it probably makes most sense to have $s_g = s_{t+1}$ but I will use $s_g$ to maintain generality. It’s true that $s_g$ may be “far” from $s_t$, but the inverse model can predict a sequence of actions if needed.

• If the states are images, these models tend to use convolutions to get a lower dimensional featurized state representation. For instance, inverse models often process the two input images through tied (i.e., shared) convolutional weights to obtain $\phi(s_t)$ and $\phi(s_{t+1})$, upon which they’re concatenated and then processed through some fully connected layers.

As I discussed earlier, there are a number of issues related to this basic forward/inverse model design, most notably about (a) the high dimensionality of the states, and (b) the multi-modality of the action space. To be clear on (b), there may be many (or no) action(s) that let the agent go from $s_t$ to $s_g$, and the number of possibilities increases with a longer time horizon, if $s_g$ is many states in the future.

Let’s understand how the model proposed in Zero-Shot Visual Imitation mitigates (b). Their inverse model takes in $s_g$ as an arbitrary checkpoint/goal state and must output a sequence of actions that allows the agent to arrive at $s_g$. To simplify the discussion, let’s suppose we’re only interested in predicting one step in the future, so $s_g = s_{t+1}$. Their predictive physics design is shown below.

The basic one-step model, assuming that our inverse model just needs to predict one action. The convolutional layers for the inverse model use the same tied network convolutional weights. The action loss is the cross-entropy loss (assuming discrete actions), and is not written in detail due to cumbersome notation.

The main novelty here is that our predicted action $\hat{a}_t$ from the inverse model is provided as input to the forward model, along with the current state $s_t$. We then try and obtain $s_{t+1}$, the actual state that was encountered during the agent’s exploration. This loss $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ is the standard Euclidean distance and is added with the action prediction loss $\mathcal{L}(a_t,\hat{a}_t)$ which is the usual cross-entropy (for discrete actions).

Why is this extra loss function from the successor states used? It’s because we mostly don’t care which action we took, so long as it leads to the desired next state. Thus, we really want $\hat{s}_{t+1} \approx s_{t+1}$.

• There’s some subtlety with making this work. The state loss $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ treats $s_{t+1}$ as ground truth, but that assumes we took action $a_t$ from state $s_t$. If we instead took $\hat{a}_t$ from $s_t$, and $\hat{a}_t \ne a_t$, then it seems like the ground-truth should no longer be $s_{t+1}$?

Assuming we’ve trained long enough, then I understand why this will work, because the inverse model will predict $\hat{a}_t = a_t$ most of the time, and hence the forward model loss makes sense. But one has to get to that point first. In short, the forward model training must assume that the given action will actually result in a transition from $s_t$ to $s_{t+1}$.

The authors appear to mitigate this with pre-training the inverse and forward models separately. Given ground truth data $\mathcal{D} = \{s_1,a_1,s_2,\ldots,s_N\}$, we can pre-train the forward model with this collected data (no action predictions) so that it is effective at understanding the effect of actions.

This would also enable better training of the inverse model, which (as the authors point out) depends on an accurate forward model to be able to check that the predicted action $\hat{a}_t$ has the desired effect in state-space. The inverse model itself can also be pre-trained entirely on the ground-truth data while ignoring $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ from the training objective.

I think this is what the authors did, though I wish there were a few more details.

• A surprising aspect of the forward model is that it appears to predict the raw states $s_{t+1}$, which could be very high-dimensional. I’m surprised that this works, given that (Agrawal et al., 2016) explicitly avoided this by predicting lower-dimensional features. Perhaps it works, but I wish the network architecture was clear. My guess is that the forward model processes $s_t$ to be a lower dimensional vector $\psi(s_t)$, concatenates it with $\hat{a}_t$ from the inverse model, and then up-samples it to get the original image. Brandon Amos describes up-sampling in his excellent blog post. (Note: don’t call it “deconvolution.”)

Now how do we extend this for multi-step trajectories? The solution is simple: make the inverse model a recurrent neural network. That’s it. The model still predicts $\hat{a}_t$ and we use the same loss function (summing across time steps) and the same forward model. For the RNN, the convolutional layers $\phi$ take in the current state but they always take in $s_g$, the goal state. They also take in $h_{i-1}$ and $a_{i-1}$ the previous hidden unit and the previous action (not the predicted action, that would be a bit silly when we have ground truth).

The multi-step trajectory case, visualizing several steps out of many.

Thoughts:

• Why not make the forward model recurrent?

• Should we weigh shorter-term actions highly instead of summing everything equally as they appear to be doing?

• How do we actually decide the length of the action vector to predict? Or said in a better way, when do we decide that we’ve attained $s_g$?

Fortunately, the authors answer that last thought by training a deep neural network that can learn a stopping criterion. They say:

We sample states at random, and for every sampled state make positives of its temporal neighbors, and make negatives of the remaining states more distant than a certain margin. We optimize our goal classifier by cross-entropy loss.

So, states “close” to each other are positive samples, whereas “father” samples are negative. Sure, that makes sense. By distance I assume simple Euclidean distance on raw pixels? I’m generally skeptical of Euclidean distance but it might be necessary if the forward model also optimizes the same objective. I also assume this is applied after each time step, testing whether $s_i$ at time $i$ has reached $s_g$. Thus, it is not known ahead of time how many actions the RNN must be able to predict before the goal is reset.

An alternative is mentioned about treating stopping as an action. There’s some resemblance to this and DDO’s option termination criterion.

Additionally, we have this relevant comment on OpenReview:

The independent goal recognition network does not require any extra work concerning data or supervision. The data used to train the goal recognition network is the same as the data used to train the PSF. The only prior we are assuming is that nearby states to the randomly selected states are positive and far away are negative which is not domain specific. This prior provides supervision for obtaining positive and negative data points for training the goal classifier. Note that, no human supervision or any particular form of data is required in this self-supervised process.

Yes, this makes sense.

Now let’s discuss the experiments. The authors test several ablations of their model:

• An inverse model with no forward model at all (Nair et al., 2017). This is different from their earlier paper which used a forward model for regularization purposes (Agrawal et al., 2016). The model in (Nair et al., 2017) just used the inverse model for predicting an action given current image $I_t$ and (critically!) a goal image $I_{t+1}'$ specified by a human.

• A more sophisticated inverse model with an RNN, but no forward model. Think of my most recent hand-drawn figure above, except without the forward portion. Furthermore, this baseline also does not use the action $a_i$ as input to the RNN structure.

• An even more sophisticated model where the action history is now input to the RNN. Otherwise, it is the same as the one I just described above.

Thus, all three of their ablations do not use the forward consistency model and are solely trained by minimizing $\mathcal{L}(a_t,\hat{a}_t)$. I suppose this is reasonable, and to be fair, testing these out in physical trials takes a while. (Training should be less cumbersome because data collection is the bottleneck. Once they have data, they can train all of their ablations quickly.) Finally, note that all these inverse models take $(s_t,s_g)$ as input, and $s_g$ is not necessarily $s_{t+1}$. This, I remember from the greedy planner in (Agrawal et al., 2016).

The experiments are: navigating a short mobile robot throughout rooms and performing rope manipulation with the same setup from (Nair et al., 2017).

• Indoor navigation. They show the model an image of the target goal, and check if the robot can use it to arrive there. This obviously works best when few actions are needed; otherwise, waypoints are necessary. However, for results to be interesting enough, the target image should not have any overlap with the starting image.

The actions are: (1) forward 10cm, (2) turn left, (3) turn right, and (4) standing still. They use several “tricks” such as using action repeats, applying a reset maneuver, etc. A ResNet acts as the image processing pipeline, and then (I assume) the ResNet output is fed into the RNN along with the hidden layer and action vector.

Indeed, it seems like their navigating robot can reach goal states and is better than the baselines! They claim their robot learns first to turn and then to move to the target. To make results more impressive, they tested all this on a different floor from where the training data was collected. Nice! The main downside is that they conducted only eight trials for each method, which might not be enough to be entirely convincing.

Another set of experiments tests imitation learning, where the goal images are far away from the robot, thus mandating a series of checkpoint images specified by a human. Every fifth image in a human demonstration was provided as a waypoint. (Note: this doesn’t mean the robot will take exactly five steps for each waypoint even if it was well trained, because it may take four or six or some other number of actions before it deems itself close enough to the target.) Unfortunately, I have a similar complaint as earlier: I wish there were more than just three trials.

• Rope manipulation. They claim almost a 2x performance boost over (Nair et al., 2017) while using the same training data of 60K-70K interaction pairs. That’s the benefit of building upon prior work. They surprisingly never say how many trials they have, and their table reports only a “bootstrapped standard deviation”. Looking at (Nair et al., 2017), I cannot find where the 35.8% figure comes from (I see 38% in that paper but that’s not 35.8%…).

According to OpenReview comments they also trained the model from (Agrawal et al., 2016) and claim 44% accuracy. This needs to be in the final version of the paper. The difference from (Nair et al., 2017) is that (Agrawal et al., 2016) jointly train a forward model (but not to enforce dynamics but just as a regularizer), while (Nair et al., 2017) do not have any forward model.

Despite the lack of detail in some areas of the paper, (where’s the appendix?!?) I certainly enjoyed reading it and would like to try out some of this stuff.

# A Critical Comparison of Three Half Marathons I Have Run

I have now run in three half marathons: the Berkeley Half Marathon (November 2017), the Kaiser Permanente San Francisco Half Marathon (February 2018), and the Oakland Half Marathon (March 2018).

To be clear, the Kaiser Permanente San Francisco half marathon is not the same as a separate set of San Francisco races in the summers. The Oakland Half Marathon is also technically the “Kaiser Permanente […]” but since there’s only one main set of Oakland races a year — known as the “Running Festival” — we can be more lenient in our naming convention.

All these races are popular, and the routes are relatively flat and therefore great for setting PRs. I would be happy to run any of these again. In fact, I’ll probably will, for all three!

In this post, I’ll provide some brief comments on each of the races. Note that:

• When I list registration fees, it’s not always a clear-cut comparison since prices jack up closer to race day. I think I managed to get an “early bird” deal for all these races, so hopefully the prices are somewhat comparable. Also, I include taxes in the fee I list.

• By “packet pickup” I refer to when runners pick up whatever racing material is needed (typically a timing chip, bib, sometimes gear as well) a day or two before the actual race. These pickup events also involve some deals for food and running equipment from race sponsors. Below is a picture that I took of the Oakland package pickup:

• While I list “pros” and “cons” of the races, most are minor in the grand scheme of things, and this review is for those who might be picky. I reiterate that I will probably run in all of these again the next time around.

OK, let’s get started!

## Berkeley Half Marathon

• Website: here.
• Price I paid: about $100, including a$10 bib shipping fee.

Pros:

• The race has a great “local feel” to it, with lots of Berkeley students and residents both running in the race or cheering us as spectators. I saw a number of people that I knew, mostly other student runners, and it was nice to say hi to them. There was also a cool drumming band which played while we were entering the portion of the race close to the San Francisco Bay.

• The course is mostly flat, and enters a few Berkeley neighborhoods (again, a great local feel to it). There’s also a relatively straight section at the roughly 8-11 mile range by the San Francisco Bay and which lets you see the runners ahead of you when you’re entering the portion (for extra motivation). As I discussed two years ago, I regularly run by this area so I was used to the view, but I can see it being attractive for those who don’t use the same routes.

• There are lots of pacers, for half-marathon finish times of 1:27, 1:35 (2x), 1:45, 1:55, etc.

• The post-race food sampling selection was fantastic! There were the obligatory water bottles and bananas, but I also had tasty Power Crunch protein bars, Muscle Milk (this is clearly bad for you, but never mind), pretzels, cookies, coffee, etc. There was also beer, but I didn’t have any.

• Post-race deals are excellent. I used them to order some Power Crunch bars at a discount.

• The packet pickup had some decent free food samples. The race shirt is interesting — it’s a different style from prior years and feels somewhat odd but I surprisingly like it, and I’ll be wearing it both to school and for when I run in my own time.

Cons:

• There’s a $10 bib mailing fee, and I realize now that it’s pointless to pay for it because we also have to pick up a timing chip during packet pickup, and that’s when we could have gotten the bibs. Thus, there seems to be no advantage to paying for the bib to be mailed. Furthermore, I wish the timing chip were attached to the bib; we had to tie it within our shoelaces. I think it’s far easier to stick it on the bib. • The starting location is a bit awkwardly placed in the center of the city, though to be fair, I’m not sure of a better spot. Certainly it’s less convenient for drop-offs and Uber rides compared to, say, Golden Gate Park. • There were seven water stops, one of which had electrolytes and GU energy chews. (Unfortunately, when running, I actually dropped two out of the four GU chews I was given … please use the longer, thinner packages that the Oakland race uses!!) The other two races offered richer goodies at the aid stations so next time, I’ll bring my own energy stuff. • It was the most expensive of the races I’ve run in, though the difference isn’t that much, especially if you avoid making the mistake of getting your bib mailed to you. • The photography selection after the race is excellent, but it’s expensive and most of it is concentrated near the end of the race when it’s crowded, so most pictures weren’t that interesting. ## Kaiser Permanente San Francisco Half Marathon • Website: here. • Price I paid: about$80.

Upsides:

• The race route is great! I enjoyed running through Golden Gate Park and seeing the Japanese Tea Garden, the California Academy of Sciences, and so on. There’s also a very long, straight section in the second half of the race (longer than Berkeley’s!) by the ocean where you can again see the runners ahead of you on their way back.

• There’s a great selection of post-race sampling, arguably on par with Berkeley though there’s no beer. There were water bottles and bananas, along with CLIF Whey protein bars, Ocho candy, some coffee/caffeine-base drinks, etc.

• The price is the cheapest of the three, which is surprising since I figured things in San Francisco would be more expensive. I suspect it has to do with much of the race being in Golden Gate Park, and the course is set so that there isn’t a need to close many roads. On a related note, it’s also easy to drop off and pick up racers.

• You have to finish the race to get your shirt. Of course this is minor, but I believe it’s not a good idea to wear the official race shirt on race day. Incidentally, there’s no package pickup, which means we don’t get free samples or deals, but it’s probably better for me since I would have had to Uber a long distance to and back. You get the bib and timing chip mailed in advance, and the timing chip is (thankfully) attached to the bib.

Downsides:

• No pacers. I don’t normally try to stick to a pacer during my races, but I think they’re useful.

• While there was a great selection of post-race food sampling, there was no beer offered, in contrast to the Berkeley and Oakland races.

• With regards to post-race photographs, my comments on this are basically identical to those of the Berkeley race.

• All the aid stations had electrolytes (I think Nuun) in addition to water. It was a bit unclear to me which cups corresponded to what beverage, though in retrospect I should have realized that the “blank” cups had water and cups with a lightning sign on them had the electrolytes. The drinks situation is better than the Berkeley race, but the downside is that there were no GU energy chews, so perhaps it’s a wash with respect to the aid stations?

• It felt like there were fewer people cheering us on when we raced, particularly compared to the Berkeley race.

• I don’t think there were as many post-race discount deals. I was hoping that there were some deals for the CLIF whey protein bars, which would have been the analogue of the Power Crunch discount for the Berkeley race. The discount deals also lasted only a week, compared to two months for Berkeley’s post-race stuff.

## Oakland Running Festival Half Marathon

• Website: here.
• Price I paid: about \$90.

Upsides:

• The race started at 9:45am, whereas the Berkeley and San Francisco races each started at about 8:10am. While I consider myself a morning person, that’s for work. If I want to set a half marathon PR, a 9:45am starting time is far better.

• The Oakland race easily has the best aid stations compared to the other two races. Not only were there electrolytes at each station, but some also had bananas, GU gels, and GU chews (yes, GU has a lot of products!). Throughout the race I consumed two half-bananas (easy to eat since you can squeeze them), one GU gel, and one GU chew package, which contained about eight chews. This was very helpful!

• There were lots of spectators and locals cheering us on, possibly as much as the Berkeley race had.

• The view of Lake Merritt is excellent, and it’s probably the main visual attraction. Other than that, the race enters the city of Oakland throughout mostly the business sector. Also this was the only one of the three races where a marathon was simultaneously offered, so there were a few marathoners mixed in with us.

• There’s a great package pickup (which I showed a photo of earlier), which probably had as many deals as the Berkeley package pickup. We had to show up to the pickup to get the bib and the timing chip (attached to the bib). While I was there, I bought several GU products that I’ll use for my future long-distance training sessions.

• Each runner got tickets for two free Lagunitas Beer cups. We had this offering after the race, but one was enough for me. I’m not sure how people can down two servings quickly.

• There were pacers for various distances.

• Race photos are free, which is definitely refreshing compared to the other two races. Disclaimer: I’m writing this post one day after the race occurred, and I won’t be able to download the photos for a few days, so the quality may be worse on a per-photo basis.

• Unfortunately, I don’t think there are any post-race deals. Hopefully something will show up in my inbox soon so I can turn this into an “upside.” Update 03/27/2018: heh, a day later, I get an email in my inbox showing that there are some race deals. Excellent! The deals seems to be just as good as the other races, so I’ll put it as an upside.

Downsides:

• The race scenery is probably less appealing than the Berkeley or San Francisco races. The route mostly weaves throughout the city roads, and there aren’t clear views of the Bay. Also, the turn near the end of the race when we see Lake Merritt again is narrow and awkwardly placed, and it’s also hilly, which is not what I want to see at the 12th and 13th mile checkpoints.

• The post-race food sampling was probably weaker compared to the other two, though it’s debatable. There were water bottles, as you can see in my photo below, along with bananas and some peanut butter bars and energy drinks. I think the other races had more, and I was disappointed when the Oakland website said that racers would “receive bagels” because I didn’t see any! On the positive side, I got a free package of GU stroopwafel, so again, it’s debatable.

• The race isn’t as good at storing your sweats. At Berkeley, we could save our sweats in the Berkeley high school gym, and it was easy for us to retrieve our bags after the race. For Oakland, it was stored in a small tent and we had to stand in line for a while before a volunteer could find our stuff.

The finish line of the Oakland races (including the half marathon).

## Conclusion

I’m really happy that I started running half marathons. I’m signed up to run the San Francisco Second Half-Marathon in July. If you’re interested in training with me, let me know.