OneShot Visual Imitation Learning via MetaLearning
A followup paper to the one I discussed in my previous post is OneShot Visual Imitation Learning via MetaLearning. The idea is, again, to train neural network parameters \(\theta\) on a distribution of tasks such that the parameters are easy to finetune to new tasks sampled from the distribution. In this paper, the focus is on imitation learning from raw pixels and showing the effectiveness of a oneshot imitator on a physical PR2 robot.
Recall that the original MAML paper showed the algorithm applied to supervised regression (for sinusoids), supervised classification (for images), and reinforcement learning (for MuJoCo). This paper shows how to use MAML for imitation learning, and the extension is straightforward. First, each imitation task \(\mathcal{T}_i \sim p(\mathcal{T})\) contains the following information:

A trajectory \(\tau = \{o_1,a_1,\ldots,o_T,a_T\} \sim \pi_i^*\) consists of a sequence of states and actions from an expert policy \(\pi_i^*\). Remember, this is imitation learning, so we can assume an expert. Also, note that the expert policy is taskspecific.

A loss function \(\mathcal{L}(a_{1:T},\hat{a}_{1:T}) \to \mathbb{R}\) providing feedback on how closely our actions match those of the expert’s.
Since the focus of the paper is on “oneshot” learning, we assume we only have one trajectory available for the “inner” gradient update portion of metatraining for each task \(\mathcal{T}_i\). However, if you recall from MAML, we actually need at least one more trajectory for the “outer” gradient portion of metatraining, as we need to compute a “validation error” for each sampled task. This is not the overall metatest time evaluation, which relies on an entirely new task sampled from the distribution (and which only needs one trajectory, not two or more). Yes, the terminology can be confusing. When I refer to “test time evaluation” I always refer to when we have trained \(\theta\) and we are doing fewshot (or oneshot) learning on a new task that was not seen during training.
All the tasks in this paper use continuous control, so the loss function for optimizing our neural network policy \(f_\theta\) can be described as:
\[\mathcal{L}_{\mathcal{T}_i}(f_\theta) = \sum_{\tau^{(j)} \sim p(\mathcal{T}_i)} \sum_{t=1}^T \ f_\theta(o_t^{(j)})  a_t^{(j)} \_2^2\]where the first sum normally has one trajectory only, hence the “oneshot learning” terminology, but we can easily extend it to several sampled trajectories if our task distribution is very challenging. The overall objective is now:
\[{\rm minimize}_\theta \sum_{\mathcal{T}_i\sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum_{\mathcal{T}_i\sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} \Big(f_{\theta  \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)}\Big)\]and one can simply run Adam to update \(\theta\).
This paper uses two new techniques for better performance: a twoheaded architecture, and a bias transformation.

TwoHeaded Architecture. Let \(y_t^{(j)}\) be the vector of postactivation values just before the last fully connected layer which maps to motor torques. The last layer has parameters \(W\) and \(b\), so the inner loss function \(\mathcal{L}_{\mathcal{T}_i}(f_\theta)\) can be rewritten as:
\[\mathcal{L}_{\mathcal{T}_i}(f_\theta) = \sum_{\tau^{(j)} \sim p(\mathcal{T}_i)} \sum_{t=1}^T \ Wy_t^{(j)} + b a_t^{(j)} \_2^2\]where, I suppose, we should write \(\phi = (\theta, W, b)\) and redefine \(\theta\) to be all the parameters used to compute \(y_t^{(j)}\).
In this paper, the testtime single demonstration of the new task is normally provided as a sequence of observations (images) and actions. However, they also experiment with the more challenging case of removing the provided actions for that single testtime demonstration. They simply remove the action and use this inner loss function:
\[\mathcal{L}_{\mathcal{T}_i}(f_\theta) = \sum_{\tau^{(j)} \sim p(\mathcal{T}_i)} \sum_{t=1}^T \ Wy_t^{(j)} + b\_2^2\]This is still a bit confusing to me. I’m not sure why this loss function leads to the desired outcome. It’s also a bit unclear how the twoheaded architecture training works. After another read, maybe only the \(W\) and \(b\) are updated in the inner portion?
The twoheaded architecture seems to be beneficial on the simulated pushing task, with performance improving by about 56 percentage points. That may not sound like a lot, but this was in simulation and they were able to test with 444 total trials.
The other confusing part is that if we assume we’re allowed to have access to expert actions, then the realworld experiment actually used the singleheaded architecture, and not the twoheaded one. So there wasn’t a benefit to the twoheaded one assuming we have actions. Without actions, of course, the twoheaded one is our only option.

Bias Transformation. After a certain neural network layer (which in this paper is after the 2D spatial softmax applied after the convolutions to process the images), they concatenate this vector of parameters. They claim that
[…] the bias transformation increases the representational power of the gradient, without affecting the representation power of the network itself. In our experiments, we found this simple addition to the network made gradientbased metalearning significantly more stable and effective.
However, the paper doesn’t seem to show too much benefit to using the bias transformation. A comparison is reported in the simulated reaching task, with a dimension of 10, but it could be argued that performance is similar without the bias transformation. For the two other experimental domains, I don’t think they reported with and without the bias transformation.
Furthermore, neural networks already have biases. So is there some particular advantage to having more biases packed in one layer, and furthermore, with that layer being the same spot where the robot configuration is concatenated with the processed image (like what people do with selfsupervision)? I wish I understood. The math that they use to justify the gradient representation claim makes sense; I’m just missing a tiny step to figure out its practical significance.
They ran their setups on three experimental domains: simulated reaching, simulated pushing, and (drum roll please) real robotic tasks. For these domains, they seem to have tested up to 5.5K demonstrations for reaching and 8.5K for pushing. For the real robot, they used 1.3K demonstrations (ouch, I wonder how long that took!). The results certainly seem impressive, and I agree that this paper is a step towards generalist robots.