Inverse Reinforcement Learning from Preferences
It’s been a long time since I engaged in a detailed read through of an inverse reinforcement learning (IRL) paper. The idea is that, rather than the standard reinforcement learning problem where an agent explores to get samples and finds a policy to maximize the expected sum of discounted rewards, we instead are given data already, and must determine the reward function. After this reward function is learned, one can then learn a new policy based on this reward function by running standard reinforcement learning, but where the rewards for each state (or stateaction) is determined from the learned reward function. As a side note, since this appears to be quite common and “part of” IRL, then I’m not sure why IRL is often classified as an “imitation learning” algorithm when reinforcement learning has to be run as a subroutine. Keep this in mind when reading papers on imitation learning, which often categorize algorithms as supervised learning (e.g., behavioral cloning) approaches vs IRL approaches, such as in the introduction of the famous Generative Adversarial Imitation Learning paper.
In the rest of this post, we’ll cover two closelyrelated works on IRL that cleverly and effectively rely on preference rankings among trajectories. They also have similar acronyms: TREX and DREX. The TREX paper presents the Trajectoryranked Reward Extrapolation algorithm, which is also used in the DREX paper (Disturbancebased Reward Extrapolation). So we shall first discuss how reward extrapolation works in TREX, and then we will clarify the difference between the two papers.
TREX and DREX
The motivation for TREX is that in IRL, most approaches rely on defining a reward function which explains the demonstrator data and makes it appear optimal. But, what if we have suboptimal demonstrator data? Then, rather than fit a reward function to this data, it may be better to instead figure out the appropriate features of the data that convey information about the underlying intentions of the demonstrator, which may be extrapolated beyond the data. TREX does this by working with a set of demonstrations which are ranked.
To be concrete, denote a sequence of $m$ ranked trajectories:
\[\mathcal{D} = \{ \tau_1, \ldots, \tau_m \}\]where if $i<j$, then $\tau_i \prec \tau_j$, or in other words, trajectory $\tau_i$ is worse than $\tau_j$. We’ll assume that each $\tau_i$ consists of a series of states, so that neither demonstrator actions nor the reward are needed (a huge plus!):
\[\tau_i = (s_0^{(i)}, s_1^{(i)}, \ldots, s_T^{(i)})\]and we can also assume that the trajectory lengths are all the same, though this isn’t a strict requirement of TREX (since we can normalize based on length) but probably makes it more numerically stable.
From this data $\mathcal{D}$, TREX will train a learned reward function $\hat{R}_\theta(s)$ such that:
\[\sum_{s \in \tau_i} \hat{R}_\theta(s) < \sum_{s \in \tau_j} \hat{R}_\theta(s) \quad \mbox{if} \quad \tau_i \prec \tau_j\]To be clear, in the above equation there is no true environment reward at all. It’s just the learned reward function $\hat{R}_\theta$, along with the trajectory rankings. That’s it! One may, of course, use the true reward function to determine the rankings in the first place, but that is not required, and that’s a key flexibility advantage for TREX – there are many other ways we can rank trajectories.
In order to train $\hat{R}_\theta$ so the above criteria is satisfied, we can use the cross entropy loss function. Most people probably start using the crossentropy loss function in the context of classification tasks, where the neural network outputs some “logits” and the loss function tries to “get” the logits to match a true onehot vector distribution. In this case, the logic is similar. The output of the reward network forms the (unnormalized) probability that one trajectory is preferable to another:
\[P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)) \approx \frac{\exp \sum_{s \in \tau_j} \hat{R}_\theta(s) }{ \exp \sum_{s \in \tau_i}\hat{R}_\theta(s) + \exp \sum_{s \in \tau_j}\hat{R}_\theta(s) }\]when we then use in this loss function:
\[\mathcal{L}(\theta) =  \sum_{\tau_i \prec \tau_j } \log \left( \frac{\exp \sum_{s \in \tau_j} \hat{R}_\theta(s) }{\exp \sum_{s \in \tau_i} \hat{R}_\theta(s)+ \exp \sum_{s \in \tau_j}\hat{R}_\theta(s) } \right)\]Let’s deconstruct what we’re looking at here. The loss function $\mathcal{L}(\theta)$ for training $\hat{R}_\theta$ is binary cross entropy, where the two “classes” involved here are whether $\tau_i \succ \tau_j$ or $\tau_i \prec \tau_j$. (We can easily extend this to include cases when the two are equal, but let’s ignore for now.) Above, the true class corresponds to $\tau_i \prec \tau_j$.
If this isn’t clear then reviewing the cross entropy (e.g., from this source), we see that between a true distribution “$p$” and a predicted distribution “$q$”, it is defined as: $\sum_x p(x) \log q(x)$ where the sum over $x$ iterates through all possible classes – in this case we only have two classes. The true distribution is $p=[0,1]$ if we interpret the two components as expressing the class $\tau_i \succ \tau_j$ at index 0, or $\tau_i \prec \tau_j$ at index 1. In all cases, the “class” we assign is to index 1 by design. The predicted distribution comes from the output of the reward function network:
\[q = \Big[1  P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)), \; P(\hat{J}_\theta(\tau_i) < \hat{J}_\theta(\tau_j)) \Big]\]and putting this together, the cross entropy term reduces to $\mathcal{L}(\theta)$ as shown above, for a single training data point (i.e., a single training pair $(\tau_i, \tau_j)$). We would then sample many of these pairs during training for each minibatch.
To get this to work in cases when the two trajectories are ambiguous, then you can set the “target” distribution to be $[0.5, 0.5]$. This is made explicit in this NeurIPS 2018 paper from DeepMind which uses the same loss function.
The main takeaway is that this process will learn a reward function assigning greater total return to higher ranked trajectories. As long as there are features associated with higher return that are identifiable from the data, then it may be possible to extrapolate beyond the data.
Once the reward function is learned, TREX then runs policy optimization by running reinforcement learning, which in both papers here is Proximal Policy Optimization. This is done in an online fashion, but where instead of data coming in as $(s,a,r,s’)$ tuples, they will be $(s,a,\hat{R}_\theta(s),s’)$, where the reward is from the learned policy.
This makes sense, but as usual, there are a bunch of practical tips and tricks to get things working. Here are some for TREX:

For many environments, “trajectories” often refer to “episodes”, but these can last for a large number of time steps. To perform data augmentation, one can subsample trajectories of the same length among pairs of trajectories $\tau_i$ and $\tau_j$.

Training an ensemble of reward functions for $\hat{R}_\theta$ often helps, provided the individual components have values at roughly the same scale.

The reward used for the policy optimization stage might need some extra “massaging” to it. For example, with MuJoCo, the authors use a control penalty term that gets added to $\hat{R}_\theta(s)$.

To check if reward extrapolation is feasible, one can plot a graph that shows ground truth returns on the xaxis and predicted return on the yaxis. If there is strong correlation among the two, then that’s a sign extrapolation is more likely to happen.
In both TREX and DREX, the authors experiment with discrete control and continuous control using standard environments from Atari and MuJoCo, respectively, and find that overall, their two stage approach of (1) finding $\hat{R}_\theta$ from preferences and (2) running PPO on top of this learned reward function, works better than competing baselines such as Behavior Cloning and Generative Adversarial Imitation Learning, and that they can exceed the performance of the demonstration data.
The above is common to both TREX and DREX. So what’s the difference between the two papers?

TREX assumes that we have rankings available ahead of time. This can be from a number of sources. Maybe they were “ground truth” rankings based on ground truth rewards (i.e., just sum up the true reward within the $\tau_i$s), or they might be noisy rankings. An easy way to test noisy rankings is to rank trajectories based on the time in training history if we extract trajectories from an RL agent’s history. Another, but more cumbersome way (since it relies on human subjects) is to use Amazon Mechanical Turk. The TREX paper does a splendid job testing these different rankings – it’s one reason I really like the paper.

In contrast, DREX assumes these rankings are not available ahead of time. Instead, the approach involves training a policy from the provided demonstration data via Behavior Cloning, then taking that resulting snapshot and rolling it out in the environment with different noise levels. This naturally provides a ranking for the data, and only relies on the weak assumption that the Behavior Cloning agent will be better than a purely random policy. Then with these automatic rankings, DREX can just do exactly what TREX did!

DREX makes a second contribution on the theoretical side to better understand why preferences over demonstrations can reduce reward function ambiguity in IRL.
Some Theory in DREX
Here’s a little more on the theory from DREX. We’ll follow the notation from the paper and state Theorem 1 here (see the paper for context):
If the estimated reward function is $\;\hat{R}(s) = w^T\phi(s),\;$ the true reward function is \(\;R^*(s) = \hat{R}(s) + \epsilon(s)\;\) for some error function \(\;\epsilon : \mathcal{S} \to \mathbb{R}\;\) and \(\;\w\_1 \le 1,\;\) then extrapolation beyond the demonstrator, i.e., \(\; J(\hat{\pi}R^*) > J(\mathcal{D}R^*),\;\) is guaranteed if:
\[J(\pi_{R^*}^*R^*)  J(\mathcal{D}R^*) > \epsilon_\Phi + \frac{2\\epsilon\_\infty}{1  \gamma}\]where \(\;\pi_{R^*}^* \;\) is the optimal policy under $R^*$, \(\;\epsilon_\Phi = \ \Phi_{\pi_{R^*}^*}  \Phi_{\hat{\pi}}\_\infty,\;\) and \(\\epsilon\_\infty = {\rm sup}\{  \epsilon(s) : s \in \mathcal{S} \}\).
To clarify the theorem, $\hat{\pi}$ is some learned policy for which we want to outperform the average episodic return in the demonstration data $J(\mathcal{D}R^*)$. We begin by considering the difference in return between the optimal policy under the true reward (which can’t be exceeded w.r.t. that reward by definition) and the expected return of the learned polcy (also under that true reward):
\[\begin{align} J(\pi_{R^*}^*R^*)  J(\hat{\pi}R^*) \;&{\overset{(i)}=}\;\; \left \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t R^*(s) \Big]  \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t R^*(s) \Big] \right \\ \;&{\overset{(ii)}=}\;\; \left \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t (w^T\phi(s_t)+\epsilon(s_t)) \Big]  \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t (w^T\phi(s_t)+\epsilon(s_t)) \Big] \right \\ \;&{\overset{(iii)}=}\; \left w^T\Phi_{\pi_{R^*}^*} + \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t \epsilon(s_t) \Big]  w^T\Phi_{\hat{\pi}}  \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t \epsilon(s_t) \Big] \right \\ \;&{\overset{(iv)}\le}\;\; \left w^T(\Phi_{\pi_{R^*}^*} \Phi_{\hat{\pi}}) + \mathbb{E}_{\pi_{R^*}^*} \Big[ \sum_{t=0}^\infty \gamma^t \sup_{s\in \mathcal{S}} \epsilon(s) \Big]  \mathbb{E}_{\hat{\pi}} \Big[ \sum_{t=0}^\infty \gamma^t \inf_{s \in \mathcal{S}} \epsilon(s) \Big] \right \\ \;&{\overset{(v)}=}\;\; \left w^T(\Phi_{\pi_{R^*}^*} \Phi_{\hat{\pi}}) + \Big( \sup_{s\in \mathcal{S}} \epsilon(s)  \inf_{s \in \mathcal{S}} \epsilon(s) \Big) \sum_{t=0}^{\infty} \gamma^t \right \\ \;&{\overset{(vi)}\le}\;\; \left w^T(\Phi_{\pi_{R^*}^*} \Phi_{\hat{\pi}}) + \frac{2 \\epsilon\_\infty}{1\gamma} \right \\ \;&{\overset{(vii)}\le}\;\; \left w^T(\Phi_{\pi_{R^*}^*} \Phi_{\hat{\pi}})\right + \frac{2 \\epsilon\_\infty}{1\gamma} \\ \;&{\overset{(viii)}\le}\; \w\_1 \\Phi_{\pi_{R^*}^*} \Phi_{\hat{\pi}})\_\infty + \frac{2 \\epsilon\_\infty}{1\gamma} \\ &{\overset{(ix)}\le}\; \epsilon_\Phi + \frac{2\\epsilon\_\infty}{1  \gamma} \end{align}\]where

in (i), we apply the definition of the terms and put absolute values around the terms. I don’t think this is necessary since the LHS must be positive, but it doesn’t hurt.

in (ii), we substitute $R^*$ with the theorem’s assumption about both the error function and how the estimated reward is a linear combination of features.

in (iii) we move the weights $w$ outside the expectation as they are constants and we can use linearity of expectation. Then we use the paper’s definition of $\Phi_\pi$ as the expected feature counts for given policy $\pi$.

in (iv) we move the two $\Phi$ terms together (notice how this matches the theorem’s $\epsilon_\Phi$ definition), and we then make this an inequality by looking at the expectations and applying “sup”s and “infs” to each time step. This is saying if we have $AB$ then let’s make the $A$ term larger and the $B$ term smaller. Since we’re doing this for an infinite amount of time steps, I am somewhat worried that this is a loose bound.

in (v) we see that since the “sup” and “inf” terms no longer depend on $t$, we can move them outside the expectations. In fact, we don’t even need expectations anymore, since all that’s left is a sum over discounted $\gamma$ terms.

in (vi) we apply the geometric series formula to get rid of the sum over $\gamma$ and then the inequality results from replacing the “sup”s and “inf”s with the \(\ \epsilon \_\infty\) from the theorem statement – the “2” helps to cover the extremes of a large positive error and a large negative error (note the absolute value in the theorem condition, that’s important).

in (vii) we apply the Triangle Inequality.

in (viii) we apply Hölder’s inequality.

finally, in (ix) we apply the theorem statements.
We now take that final inequality and subtract the average demonstration data return on both sides:
\[\underbrace{J(\pi_{R^*}^*R^*) J(\mathcal{D}R^*)}_{\delta}  J(\hat{\pi}R^*) \le \epsilon_\Phi + \frac{2\\epsilon\_\infty}{1  \gamma}  J(\mathcal{D}R^*)\]Now we finally invoke the “if” condition in the theorem. If the equation in the theorem holds, then we can replace $\delta$ above as follows since it’s just reducing the LHS:
\[\epsilon_\Phi + \frac{2\\epsilon\_\infty}{1  \gamma}  J(\hat{\pi}R^*) \le \epsilon_\Phi + \frac{2\\epsilon\_\infty}{1  \gamma}  J(\mathcal{D}R^*)\]which implies:
\[ J(\hat{\pi}R^*) \le  J(\mathcal{D}R^*) \quad \Longrightarrow \quad J(\hat{\pi}R^*) > J(\mathcal{D}R^*),\]showing that $\hat{\pi}$ has extrapolated beyond the data.
What’s the intuition behind the theorem? The LHS of the theorem shows the difference in the return based on the optimal policy versus the demonstration data. By definition of optimality, the LHS is at least 0, but it can get very close to 0 if the demonstration data is very good. That’s not good for extrapolation, and hence the condition for outperforming the demonstrator is less likely to hold (which makes sense). Focusing on the RHS, we see that it’s value is larger if the maximum error in $\epsilon$ is large. This might be a very restrictive condition, since it’s considering the maximum absolute error over the entire state set $\mathcal{S}$. Since there are an infinite amount of states in many practical applications, this means even one large error might cause the inequality in the theorem statement to fail.
The proof also relies on the assumption that the estimated reward function is a linear combination of features (that’s what $\hat{R}(s)=w^T\phi(s)$ means) but $\phi$ could contain arbitrarily complex features, so I guess it’s a weak assumption (which is good), but I am not sure?
Concluding Remarks
Overall, the TREX and DREX papers are nice IRL papers that rely on preferences between trajectories. The takeaways I get from these works:

While reinforcement learning may be very exciting, don’t forget about the perhaps lesserknown task of inverse reinforcement learning.

Taking subsamples of trajectories is a helpful way to do data augmentation when doing anything at the granularity of episodes.

Perhaps most importantly, I should understand when and how preference rankings might be applicable and beneficial. In these works, preferences enable them to train an agent to perform better than demonstrator data without strictly requiring ground truth environment rewards, and potentially without even requiring demonstrator actions (though DREX requires actions).
I hope you found this post helpful. As always, thank you for reading, and stay safe.
Papers covered in this blog post:

Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations, ICML 2019.

Daniel S. Brown, Wonjoon Goo, Scott Niekum. BetterthanDemonstrator Imitation Learning via AutomaticallyRanked Demonstrations, CoRL 2019.