Seita's PlaceThis is my blog, where I have written over 225 articles on a variety of topics, most of which are about one of two major themes. The first is computer science, which is my area of specialty as a Ph.D. student at UC Berkeley. The second can be broadly categorized as "deafness," which relates to my experience and knowledge of being deaf.
https://danieltakeshi.github.io/
Tue, 28 Mar 2017 15:31:55 +0000Tue, 28 Mar 2017 15:31:55 +0000Jekyll v3.4.3Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients<p>As I stated <a href="https://danieltakeshi.github.io/2017/03/23/keeping-track-of-research-articles-my-paper-notes-repository/">in my last blog post</a>, I am feverishly trying to read more
research papers. One category of papers that seems to be coming up a lot
recently are those about <em>policy gradients</em>, which are a popular class of
reinforcement learning algorithms which estimate a gradient for a function
approximator. Thus, the purpose of this blog post is for me to explicitly write
the mathematical foundations for policy gradients so that I can gain
understanding. In turn, I hope some of my explanations will be useful to a
broader audience of AI students.</p>
<h1 id="assumptions-and-problem-statement">Assumptions and Problem Statement</h1>
<p>In any type of research domain, we always have to make some set of assumptions.
(By “we”, I refer to the researchers who write papers on this.) With
reinforcement learning and policy gradients, the assumptions usually mean the
<strong>episodic</strong> setting where an agent engages in multiple <strong>trajectories</strong> in its
environment. As an example, an agent could be playing a game of Pong, so one
episode or trajectory consists of a full start-to-finish game.</p>
<p>We define a trajectory <script type="math/tex">\tau</script> of length <script type="math/tex">T</script> as</p>
<script type="math/tex; mode=display">\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)</script>
<p>where <script type="math/tex">s_0</script> comes from the starting distribution of states, <script type="math/tex">a_i \sim
\pi_\theta(a_i| s_i)</script>, and <script type="math/tex">s_i \sim P(s_i | s_{i-1},a_{i-1})</script> with <script type="math/tex">P</script> the
dynamics model (i.e. how the environment changes). We actually <em>ignore</em> the
dynamics when optimizing, since all we care about is getting a good gradient
signal for <script type="math/tex">\pi_\theta</script> to make it better. If this isn’t clear now, it will be
clear soon. Also, the reward can be computed from the states and actions, since
it’s usually a function of <script type="math/tex">(s_i,a_i,s_{i+1})</script>, so it’s not technically needed
in the trajectory.</p>
<p>What’s our <em>goal</em> here with policy gradients? Unlike algorithms such as DQN,
which strive to find an excellent policy indirectly through Q-values, policy
gradients perform a <em>direct</em> gradient update on a policy to change its
parameters, which is what makes it so appealing. Formally, we have:</p>
<script type="math/tex; mode=display">{\rm maximize}_{\theta}\; \mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T-1}\gamma^t r_t\right]</script>
<ul>
<li>
<p><strong>Note I</strong>: I put <script type="math/tex">\pi_{\theta}</script> under the expectation. This means the
rewards are computed from a trajectory which was generated under the policy
<script type="math/tex">\pi_\theta</script>. We have to <em>find</em> “optimal” settings of <script type="math/tex">\theta</script> to make
this work.</p>
</li>
<li>
<p><strong>Note II</strong>: we don’t need to optimize the expected sum of discounted rewards,
though it’s the formulation I’m most used to. Alternatives include ignoring
<script type="math/tex">\gamma</script> by setting it to one, extending <script type="math/tex">T</script> to infinity if the episodes
are infinite-horizon, and so on.</p>
</li>
</ul>
<p>The above raises the all-important question: <em>how do we find the best
<script type="math/tex">\theta</script></em>? If you’ve taken optimization classes before, you should know the
answer already: perform gradient ascent on <script type="math/tex">\theta</script>, so we have <script type="math/tex">\theta
\leftarrow \theta + \alpha \nabla f(x)</script> where <script type="math/tex">f(x)</script> is the function being
optimized. Here, that’s the expected value of whatever sum of rewards formula
we’re using.</p>
<h1 id="two-steps-log-derivative-trick-and-determining-log-probability">Two Steps: Log-Derivative Trick and Determining Log Probability</h1>
<p>Before getting to the computation of the gradient, let’s first review two
mathematical facts which will be used later, and which are also of independent interest.
The first is the “log-derivative” trick, which tells us how to insert a log into
an expectation when starting from <script type="math/tex">\nabla_\theta \mathbb{E}[f(x)]</script>.
Specifically, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}[f(x)] &= \nabla_\theta \int p_\theta(x)f(x)dx \\
&= \int \frac{p_\theta(x)}{p_\theta(x)} \nabla_\theta p_\theta(x)f(x)dx \\
&= \int p_\theta(x)\nabla_\theta \log p_\theta(x)f(x)dx \\
&= \mathbb{E}\Big[f(x)\nabla_\theta \log p_\theta(x)\Big]
\end{align} %]]></script>
<p>where <script type="math/tex">p_\theta</script> is the density of <script type="math/tex">x</script>. Most of these steps should be
straightforward. The main technical detail to worry about is exchanging the
gradient with the integral. I have never been comfortable in knowing when we are
allowed to do this or not, but since everyone else does this, I will follow
them.</p>
<p>Another technical detail we will need is the gradient of the log probability of
a <em>trajectory</em> since we will later switch <script type="math/tex">x</script> from above with a trajectory
<script type="math/tex">\tau</script>. The computation of <script type="math/tex">\log p_\theta(\tau)</script> proceeds as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \log p_\theta(\tau) &= \nabla \log \left(\mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t)P(s_{t+1}|s_t,a_t)\right) \\
&= \nabla_\theta \left[\log \mu(s_0)+ \sum_{t=0}^{T-1} (\log \pi_\theta(a_t|s_t) + \log P(s_{t+1}|s_t,a_t)) \right]\\
&= \nabla_\theta \sum_{t=0}^{T-1}\log \pi_\theta(a_t|s_t)
\end{align} %]]></script>
<p>The probability of <script type="math/tex">\tau</script> decomposes into a chain of probabilities by the
Markov Decision Process assumption, whereby the next action only depends on the
current state, and the next state only depends on the current state and action.
To be explicit, we use the functions that we already defined: <script type="math/tex">\pi_\theta</script> and
<script type="math/tex">P</script> for the policy and dynamics, respectively. (Here, <script type="math/tex">\mu</script> represents the
starting state distribution.) We also observe that when taking gradients, the
dynamics disappear!</p>
<h1 id="computing-the-raw-gradient">Computing the Raw Gradient</h1>
<p>Using the two tools above, we can now get back to our original goal, which was
to compute the gradient of the expected sum of (discounted) rewards. Formally,
let <script type="math/tex">R(\tau)</script> be the reward function we want to optimize (i.e. maximize).
Using the above two tricks, we obtain:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim
\pi_\theta} \left[R(\tau) \cdot \nabla_\theta \left(\sum_{t=0}^{T-1}\log
\pi_\theta(a_t|s_t)\right)\right]</script>
<p>In the above, the expectation is with respect to the policy function, so think
of it as <script type="math/tex">\tau \sim \pi_\theta</script>. In practice, we need trajectories to get an
empirical expectation, which estimates this actual expectation.</p>
<p>So that’s the gradient! Unfortunately, we’re not quite done yet. The naive way
is to run the agent on a batch of episodes, get a set of trajectories (call it
<script type="math/tex">\hat{\tau}</script>) and update with <script type="math/tex">\theta \leftarrow \theta + \alpha
\nabla_\theta \mathbb{E}_{\tau \in \hat{\tau}}[R(\tau)]</script> using the empirical
expectation, but this will be too slow and unreliable due to high variance on
the gradient estimates. After one batch, we may exhibit a wide range of results:
much better performance, equal performance, or <em>worse</em> performance. The high
variance of these gradient estimates is precisely why there has been so much
effort devoted to variance reduction techniques. (I should also add from
personal research experience that variance reduction is certainly not limited to
reinforcement learning; it also appears in many statistical projects which
concern a bias-variance tradeoff.)</p>
<h1 id="how-to-introduce-a-baseline">How to Introduce a Baseline</h1>
<p>The standard way to reduce the variance of the above gradient estimates is to
insert a <strong>baseline function</strong> <script type="math/tex">b(s_t)</script> inside the expectation.</p>
<p>For concreteness, assume <script type="math/tex">R(\tau) = \sum_{t=0}^{T-1}r_t</script>, so we have no
discounted rewards. We can express the policy gradient in three equivalent, but
perhaps non-intuitive ways:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}\Big[R(\tau)\Big] \;&{\overset{(i)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[\left(\sum_{t=0}^{T-1}r_t\right) \cdot \nabla_\theta \left(\sum_{t=0}^{T-1}\log \pi_\theta(a_t|s_t)\right)\right] \\
&{\overset{(ii)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t'=0}^{T-1} r_{t'} \sum_{t=0}^{t'}\nabla_\theta \log \pi_\theta(a_t|s_t)\right] \\
&{\overset{(iii)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^{T-1}r_{t'}\right) \right]
\end{align} %]]></script>
<p>Comments:</p>
<ul>
<li>
<p><strong>Step (i)</strong> follows from plugging in our chosen <script type="math/tex">R(\tau)</script> into the policy
gradient we previously derived.</p>
</li>
<li>
<p><strong>Step (ii)</strong> follows from first noting that <script type="math/tex">\nabla_\theta
\mathbb{E}_{\tau}\Big[r_{t'}\Big] = \mathbb{E}_\tau\left[r_{t'} \cdot
\sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_t|s_t)\right]</script>. The reason
why this is true can be somewhat tricky to identify. I find it easy to think
of just re-defining <script type="math/tex">R(\tau)</script> as <script type="math/tex">r_{t'}</script> for some fixed time-step <script type="math/tex">t'</script>.
Then, we do the exact same computation above to get the final result, as shown
in the equation of the “Computing the Raw Gradient” section. The main
difference now is that since we’re considering the reward at time <script type="math/tex">t'</script>, our
trajectory under expectation <em>stops</em> at that time. More concretely,
<script type="math/tex">\nabla_\theta\mathbb{E}_{(s_0,a_0,\ldots,s_{T})}\Big[r_{t'}\Big] =
\nabla_\theta\mathbb{E}_{(s_0,a_0,\ldots,s_{t'})}\Big[r_{t'}\Big]</script>. This is
like “throwing away variables” when taking expectations due to “pushing
values” through sums and summing over densities (which cancel out); I have
another example later in this post which makes this explicit.</p>
<p>Next, we sum over both sides, for <script type="math/tex">t' = 0,1,\ldots,T-1</script>. Assuming we can
exchange the sum with the gradient, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] &= \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t'=0}^{T-1} r_{t'}\right] \\
&= \sum_{t'=0}^{T-1}\nabla_\theta \mathbb{E}_{\tau^{(t')}} \Big[r_{t'}\Big] \\
&= \sum_{t'}^{T-1} \mathbb{E}_{\tau^{(t')}}\left[r_{t'} \cdot \sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_t|s_t)\right] \\
&= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t'}^{T-1} r_{t'} \cdot \sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_t|s_t)\right].
\end{align} %]]></script>
<p>where <script type="math/tex">\tau^{(t')}</script> indicates the trajectory up to time <script type="math/tex">t'</script>. (Full
disclaimer: I’m not sure if this formalism with <script type="math/tex">\tau</script> is needed, and I
think most people would do this computation without worrying about the precise
expectation details.)</p>
</li>
<li>
<p><strong>Step (iii)</strong> follows from a nifty algebra trick. To simplify the subsequent
notation, let <script type="math/tex">f_t := \nabla_\theta \log \pi_\theta(a_t|s_t)</script>. In addition,
<strong>ignore the expectation</strong>; we’ll only re-arrange the inside here. With this
substitution and setup, the sum inside the expectation from <strong>Step (ii)</strong>
turns out to be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
r_0f_0 &+ \\
r_1f_0 &+ r_1f_1 + \\
r_2f_0 &+ r_2f_1 + r_2f_2 + \\
\cdots \\
r_{T-1}f_0 &+ r_{T-1}f_1 + r_{T-1}f_2 \cdots + r_{T-1}f_{T-1}
\end{align} %]]></script>
<p>In other words, each <script type="math/tex">r_{t'}</script> has its own <em>row</em> of <script type="math/tex">f</script>-value to which it
gets distributed. Next, <em>switch to the column view</em>: instead of summing
row-wise, sum <em>column-wise</em>. The first column is <script type="math/tex">f_0 \cdot
\left(\sum_{t=0}^{T-1}r_t\right)</script>. The second is <script type="math/tex">f_1 \cdot
\left(\sum_{t=1}^{T-1}r_t\right)</script>. And so on. Doing this means we get the
desired formula after replacing <script type="math/tex">f_t</script> with its real meaning and hitting the
expression with an expectation.</p>
</li>
</ul>
<p>Note: it is <em>very easy</em> to make a typo with these. I checked my math carefully
and cross-referenced it with references online (which <em>themselves</em> have typos).
If any readers find a typo, please let me know.</p>
<p>Using the above formulation, we finally introduce our baseline <script type="math/tex">b</script>, which is a
function of <script type="math/tex">s_t</script> (and <em>not</em> <script type="math/tex">s_{t'}</script>, I believe). We “insert” it inside the
term in parentheses:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] =
\mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log
\pi_\theta(a_t|s_t) \left(\sum_{t'=t}^{T-1}r_{t'} - b(s_t)\right) \right]</script>
<p>At first glance, it doesn’t seem like this will be helpful, and one might wonder
if this would cause the gradient estimate to become biased. Fortunately, it
turns out that this is not a problem. This was surprising to me, because all we
know is that <script type="math/tex">b(s_t)</script> is a function of <script type="math/tex">s_t</script>. However, this is a bit
misleading because usually we want <script type="math/tex">b(s_t)</script> to be the <em>expected return</em>
starting at time <script type="math/tex">t</script>, which means it really “depends” on the subsequent time
steps. For now, though, just think of it as a function of <script type="math/tex">s_t</script>.</p>
<h1 id="understanding-the-baseline">Understanding the Baseline</h1>
<p>In this final section, I first go over why inserting <script type="math/tex">b</script> above doesn’t make
our gradient estimate biased. Next, I will go over why the baseline reduces
variance of the gradient estimate. These two capture the best of both worlds:
staying unbiased and reducing variance. In general, any time you have an
unbiased estimate and it remains so after applying a variance reduction
technique, then apply that variance reduction!</p>
<p>First, let’s show that the gradient estimate is unbiased. We see that with the
baseline, we can distribute and rearrange and get:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] =
\mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log
\pi_\theta(a_t|s_t) \left(\sum_{t'=t}^{T-1}r_{t'}\right) - \sum_{t=0}^{T-1}
\nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t) \right]</script>
<p>Due to linearity of expectation, all we need to show is that for any single time
<script type="math/tex">t</script>, the gradient of <script type="math/tex">\log \pi_\theta(a_t|s_t)</script> multiplied with <script type="math/tex">b(s_t)</script>
is zero. This is true because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}_{\tau \sim \pi_\theta}\Big[\nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t)\Big] &= \mathbb{E}_{s_{0:t},a_{0:t-1}}\Big[ \mathbb{E}_{s_{t+1:T},a_{t:T-1}} [\nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t)]\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t-1}}\Big[ b(s_t) \cdot \underbrace{\mathbb{E}_{s_{t+1:T},a_{t:T-1}} [\nabla_\theta \log \pi_\theta(a_t|s_t)]}_{E}\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t-1}}\Big[ b(s_t) \cdot \mathbb{E}_{a_t} [\nabla_\theta \log \pi_\theta(a_t|s_t)]\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t-1}}\Big[ b(s_t) \cdot 0 \Big] = 0
\end{align} %]]></script>
<p>Here are my usual overly-detailed comments (apologies in advance):</p>
<ul>
<li>
<p><strong>Note I</strong>: this notation is similar to what I had before. The trajectory
<script type="math/tex">s_0,a_0,\ldots,a_{T-1},s_{T}</script> is now represented as <script type="math/tex">s_{0:T},a_{0:T-1}</script>.
In addition, the expectation is split up, which is allowed. If this is
confusing, think of the definition of the expectation with respect to at least
two variables. We can write brackets in any appropriately enclosed location.
Furthermore, we can “omit” the un-necessary variables in going from
<script type="math/tex">\mathbb{E}_{s_{t+1:T},a_{t:T-1}}</script> to <script type="math/tex">\mathbb{E}_{a_t}</script> (see expression
<script type="math/tex">E</script> above). Concretely, assuming we’re in discrete-land with actions in
<script type="math/tex">\mathcal{A}</script> and states in <script type="math/tex">\mathcal{S}</script>, this is because <script type="math/tex">E</script> evaluates
to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
E &= \sum_{a_t\in \mathcal{A}}\sum_{s_{t+1}\in \mathcal{S}}\cdots \sum_{s_T\in \mathcal{S}} \underbrace{\pi_\theta(a_t|s_t)P(s_{t+1}|s_t,a_t) \cdots P(s_T|s_{T-1},a_{T-1})}_{p((a_t,s_{t+1},a_{t+1}, \ldots, a_{T-1},s_{T}))} (\nabla_\theta \log \pi_\theta(a_t|s_t)) \\
&= \sum_{a_t\in \mathcal{A}} \pi_\theta(a_t|s_t)\nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{s_{t+1}\in \mathcal{S}} P(s_{t+1}|s_t,a_t) \sum_{a_{t+1}\in \mathcal{A}}\cdots \sum_{s_T\in \mathcal{S}} P(s_T|s_{T-1},a_{T-1})\\
&= \sum_{a_t\in \mathcal{A}} \pi_\theta(a_t|s_t)\nabla_\theta \log \pi_\theta(a_t|s_t)
\end{align} %]]></script>
<p>This is true because of the definition of expectation, whereby we get the
joint density over the entire trajectory, and then we can split it up like we
did earlier with the gradient of the log probability computation. We can
distribute <script type="math/tex">\nabla_\theta \log \pi_\theta(a_t|s_t)</script> all the way back to (but
not beyond) the first sum over <script type="math/tex">a_t</script>. Pushing sums “further back” results in
a bunch of sums over densities, each of which sums to one. The astute reader
will notice that this is precisely what happens with <a href="https://danieltakeshi.github.io/2015-07-12-notes-on-exact-inference-in-graphical-models/">variable elimination for
graphical models</a>. (The more technical reason why “pushing values back
through sums” is allowed has to do with abstract algebra properties of the sum
function, which is beyond the scope of this post.)</p>
</li>
<li>
<p><strong>Note II</strong>: This proof above also works with an infinite-time horizon. In
Appendix B of the <em>Generalized Advantage Estimation</em> paper (<a href="https://arxiv.org/abs/1506.02438">arXiv link</a>),
the authors do so with a proof exactly matching the above, except that <script type="math/tex">T</script>
and <script type="math/tex">T-1</script> are now infinity.</p>
</li>
<li>
<p><strong>Note III</strong>: About the expectation going to zero, that’s due to a
well-known fact about <em>score</em> functions, which are precisely the gradient of
log probailities. We went over this in <a href="https://danieltakeshi.github.io/2016/12/20/review-of-theoretical-statistics-stat-210a-at-berkeley/">my STAT 210A class last fall</a>. It’s
<em>again</em> the log derivative trick. Observe that:</p>
<script type="math/tex; mode=display">\mathbb{E}_{a_t}\Big[\nabla_\theta \log \pi_\theta(a_t|s_t)\Big]
= \int \frac{\nabla_\theta
\pi_\theta(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\pi_{\theta}(a_t|s_t)da_t
= \nabla_\theta \int \pi_{\theta}(a_t|s_t)da_t = \nabla_\theta \cdot 1 = 0</script>
<p>where the last step follows from how <script type="math/tex">\pi_\theta</script> is a density. This follows
for all time steps, and since the gradient of the log gets distributed for
each <script type="math/tex">t</script>, it applies in all time steps. I switched to the continuous-land
version for this, but it also applies with sums, as I just recently used in
Note I.</p>
</li>
</ul>
<p>The above shows that introducing <script type="math/tex">b</script> doesn’t cause bias.</p>
<p>The last thing to cover is why its introduction reduces variance. I provide an
approximate argument. To simplify notation, set <script type="math/tex">R_t(\tau) =
\sum_{t'=t}^{T-1}r_{t'}</script>. We focus on the <em>inside</em> of the expectation (of the
gradient estimate) to analyze the variance. The technical reason for this is
that expectations are technically <em>constant</em> (and thus have variance zero) but
in practice we have to approximate the expectations with trajectories, and that
has high variance.</p>
<p>The variance is approximated as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
{\rm Var}\left(\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) (R_t(\tau)-b(s_t))\right)\;&\overset{(i)}{\approx}\; \sum_{t=0}^{T-1} \mathbb{E}\tau\left[\Big(\nabla_\theta \log \pi_\theta(a_t|s_t) (R_t(\tau)-b(s_t))\Big)^2\right] \\
\;&{\overset{(ii)}{\approx}}\; \sum_{t=0}^{T-1} \mathbb{E}_\tau \left[\Big(\nabla_\theta \log \pi_\theta(a_t|s_t)\Big)^2\right]\mathbb{E}_\tau\left[\Big(R_t(\tau) - b(s_t))^2\right]
\end{align} %]]></script>
<p><strong>Approximation (i)</strong> is because we are approximating the variance of a sum by
computing the sum of the variances. This is not true in general, but if we can
assume this, then by the definition of the variance <script type="math/tex">{\rm Var}(X) :=
\mathbb{E}[X^2]-(\mathbb{E}[X])^2</script>, we are left with the <script type="math/tex">\mathbb{E}[X^2]</script>
term since we already showed that introducing the baseline doesn’t cause bias.
<strong>Approximation (ii)</strong> is because we assume independence among the values
involved in the expectation, and thus we can factor the expectation.</p>
<p>Finally, we are left with the term <script type="math/tex">\mathbb{E}_{\tau} \left[\Big(R_t(\tau) -
b(s_t))^2\right]</script>. If we are able to optimize our choice of <script type="math/tex">b(s_t)</script>, then
this is a least squares problem, and it is well known that the optimal choice of
<script type="math/tex">b(s_t)</script> is to be the expected value of <script type="math/tex">R_t(\tau)</script>. In fact, that’s <em>why</em>
policy gradient researchers usually want <script type="math/tex">b(s_t) \approx
\mathbb{E}[R_t(\tau)]</script> to approximate the expected return starting at time
<script type="math/tex">t</script>. At last, I understand.</p>
<p>How accurate are these approximations in practice? My intuition is that they are
actually fine, because recent advances in reinforcement learning algorithms,
such as A3C, focus on the problem of breaking correlation among samples. If the
correlation among samples is broken, then Approximation (i) becomes better,
because I think the samples <script type="math/tex">s_0,a_0,\ldots,a_{T-1},s_{T}</script> are <em>no longer
generated from the same trajectory</em>.</p>
<p>Well, that’s my intuition. If anyone else has a better way of describing it,
feel free to let me know in the comments or by email.</p>
<h1 id="concluding-remarks">Concluding Remarks</h1>
<p>Hopefully, this is a helpful, self-contained, bare-minimum introduction to
policy gradients. I am trying to learn more about these algorithms, and going
through the math details is helpful. This will also make it easier for me to
understand the increasing number of research papers that are using this
notation.</p>
<p>I also have to mention: I remember a few years ago during the <a href="https://danieltakeshi.github.io/2015-12-17-review-of-deep-reinforcement-learning-cs-294-112-at-berkeley/">first iteration
of CS 294-112</a> that I had no idea how policy gradients worked. Now, I think I
have become slightly more enlightened.</p>
<p><strong>Acknowledgements</strong>: I thank John Schulman for making his notes publicly
available.</p>
Tue, 28 Mar 2017 06:00:00 +0000
https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/
https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/Keeping Track of Research Articles: My Paper Notes Repository<p>The number of research papers in Artificial Intelligence has reached
un-manageable proportions. Conferences such as ICML, NIPS, and ICLR others are
getting record amounts of paper submissions. In addition, tens of AI-related
papers get uploaded to arXiv <em>every weekday</em>. With all these papers, it can be
easy to feel lost and overwhelmed.</p>
<p>Like many researchers, I think I do not read enough research papers. This year,
I resolved to change that, so I started an <a href="https://github.com/DanielTakeshi/Paper_Notes">open-source GitHub repository called
“Paper Notes”</a> where I list papers that I’ve read along with my personal
notes and summaries, if any. Papers without such notes are currently on my TODO
radar.</p>
<p>After almost three months, I’m somewhat pleased with my reading progress. There
are a healthy number of papers (plus notes) listed, arranged by subject matter
and then further arranged by year. Not enough for me, but certainly not terrible
either.</p>
<p>I was inspired to make this by seeing <a href="https://github.com/dennybritz/deeplearning-papernotes">Denny Britz’s similar repository</a>,
along with <a href="https://blog.acolyer.org/about/">Adrian Colyer’s blog</a>. My repository is similar to Britz’s,
though my aim is not to list all papers in Deep Learning, but to write down the
ones that I actually plan to read at some point. (I see other repositories where
people simply list Deep Learning papers without notes, which seems pretty
pointless to me.) Colyer’s blog posts represent the kind of notes that I’d like
to take for each paper, but I know that I can’t dedicate <em>that</em> much time to
fine-tuning notes.</p>
<p>Why did I choose GitHub as the backend for my paper management, rather than
something like <a href="https://www.mendeley.com/">Mendeley</a>? First, GitHub is the default place where (pretty
much) everyone in AI puts their open-source stuff: blogs, code, you name it. I’m
already used to GitHub, so Mendeley would have to provide some serious benefit
for me to switch over. I also don’t need to use advanced annotation and
organizing materials, given that the top papers are easily searchable online
(including their BibTeX references). In addition, by making my Paper Notes
repository online, I can show this as evidence to others that I’m reading
papers. Maybe this will even impress a few folks, and I say this only because
everyone wants to be noticed in some way; that’s partly Colyer’s inspiration for
his blog. So I think, on balance, it will be useful for me to keep updating
this repository.</p>
Thu, 23 Mar 2017 22:30:00 +0000
https://danieltakeshi.github.io/2017/03/23/keeping-track-of-research-articles-my-paper-notes-repository/
https://danieltakeshi.github.io/2017/03/23/keeping-track-of-research-articles-my-paper-notes-repository/What Biracial People Know<p><a href="https://www.nytimes.com/2017/03/04/opinion/sunday/what-biracial-people-know.html">There’s an opinion piece in the New York Times</a> by Moises Velasquez-Manoff
which talks about (drum roll please) biracial people. As he mentions:</p>
<blockquote>
<p>Multiracials make up an estimated 7 percent of Americans, according to the Pew
Research Center, and they’re predicted to grow to 20 percent by 2050.</p>
</blockquote>
<p>Thus, I suspect that sometime in the next few decades, we will start talking
about race in terms of precise racial percentages, such as “100 percent White”
or in rarer cases, “25 percent White, 25 percent Asian, 25 percent Black, and 25
percent Native American.” (Incidentally, I’m not sure why the article uses
“Biracial” when “Multiracial” would clearly have been a more appropriate term;
it was likely due to the Barack Obama factor.)</p>
<p>The phrase “precise racial percentages” is misleading. Since all humans came
from the same ancestor, at some point in history we must have been “one race.”
For the sake of defining these racial percentages, we can take a date — say
4000BC — when, presumably, the various races were sufficiently different,
ensconced in their respective geographic regions, and when interracial marriages
(or rape) was at a minimum. All humans alive at that point thus get a “100
percent [insert_race_here]” attached to them, and we do the arithmetic from
there.</p>
<p>What usually happens in practice, though, is that we often default to describing
one part of one race, particularly with people who are <script type="math/tex">X</script> percent Black,
where <script type="math/tex">X > 0</script>. This is a relic of the embarrassing “One Drop Rule” the United
States had, but for now it’s probably — well, I hope — more for
self-selecting racial identity.</p>
<p>Listing precise racial percentages would help us better identify people who are
not easy to immediately peg in racial categories, which will increasingly become
an issue as more and more multiracial people like me blur the lines between the
races. In fact, this is already a problem for me even with single-race people:
I sometimes cannot distinguish between Hispanics versus Whites. For instance, I
thought Ted Cruz and Marco Rubio were 100 percent White.</p>
<p>Understanding race is also important when considering racial diversity and
various ethical or sensitive questions over who should get “preferences.” For
instance, I wonder if people label me as a “privileged white male” or if I get a
pass for being biracial? Another question: for a job at a firm which has had a
history of racial discrimination and is trying to make up for that, should the
applicant who is 75 percent Black, 25 percent White, get a hair’s preference
versus someone who is 25 percent Black and 75 percent White? Would this also
apply if they actually have very similar skin color?</p>
<p>In other words, does one weigh more towards the looks or the precise
percentages? I think the precise percentages method is the way schools,
businesses, and government operate, despite how this isn’t the case in casual
conversations.</p>
<p>Anyway, these are some of the thoughts that I have as we move towards a more
racially diverse society, as multiracial people cannot have single-race children
outside of adoption.</p>
<p>Back to the article: as one would expect, it discusses the benefits of racial
diversity. I can agree with the following passage:</p>
<blockquote>
<p>Social scientists find that homogeneous groups like [Donald Trump’s] cabinet
can be less creative and insightful than diverse ones. They are more prone to
groupthink and less likely to question faulty assumptions.</p>
</blockquote>
<p>The caveat is that this assumes the people involved are equally qualified; a
racially homogeneous (in whatever race), but extremely well-educated cabinet
would be much better than a racially diverse cabinet where no one even finished
high school. But controlling for quality, I can agree.</p>
<p>Diversity also benefits individuals, as the author notes. It is here where Mr.
Velasquez-Manoff points out that Barack Obama was not just Black, but also
biracial, which may have benefited his personal development. Multiracials make
up a large fraction of the population in racially diverse Hawaii, where Obama
was born (albeit, probably with more Asian-White overlap).</p>
<p>Yes, I agree that diversity is important for a variety of reasons. It is not
easy, however:</p>
<blockquote>
<p>It’s hard to know what to do about this except to acknowledge that diversity
isn’t easy. It’s uncomfortable. It can make people feel threatened. “We
promote diversity. We believe in diversity. But diversity is hard,” Sophie
Trawalter, a psychologist at the University of Virginia, told me.</p>
</blockquote>
<blockquote>
<p>That very difficulty, though, may be why diversity is so good for us. “The pain
associated with diversity can be thought of as the pain of exercise,” Katherine
Phillips, a senior vice dean at Columbia Business School, writes. “You have to
push yourself to grow your muscles.”</p>
</blockquote>
<p>I cannot agree more.</p>
<p>Moving on:</p>
<blockquote>
<p>Closer, more meaningful contact with those of other races may help assuage the
underlying anxiety. Some years back, Dr. Gaither of Duke ran an intriguing
study in which incoming white college students were paired with either
same-race or different-race roommates. After four months, roommates who lived
with different races had a more diverse group of friends and considered
diversity more important, compared with those with same-race roommates. After
six months, they were less anxious and more pleasant in interracial
interactions.</p>
</blockquote>
<p>Ouch, this felt like a blindsiding attack, and is definitely my main gripe with
this article. In college, I had two roommates, both of whom have a different
racial makeup than me. They both seemed to be relatively popular and had little
difficulty mingling with a diverse group of students. Unfortunately, I certainly
did not have a “diverse group of friends.” After all, if there was a prize for
college for “least popular student” I would be a perennial contender. (As
incredible as it may sound, in high school, where things were <em>worse</em> for me, I
can remember a handful of people who might have been even <em>lower</em> on the social
hierarchy.)</p>
<p>Well, I guess what I want to say is that, this attack notwithstanding, Mr.
Velasquez-Manoff’s article brings up interesting and reasonably accurate points
about biracial people. At the very least, he writes about concepts which are
sometimes glossed over or under-appreciated nowadays in our discussions about
race.</p>
Sat, 11 Mar 2017 12:30:00 +0000
https://danieltakeshi.github.io/2017/03/11/what-biracial-people-know/
https://danieltakeshi.github.io/2017/03/11/what-biracial-people-know/Understanding Generative Adversarial Networks<p style="text-align:center;"> <img src="https://danieltakeshi.github.io/assets/goodfellow_gradients.png" alt="gradients" /> </p>
<p>Over the last few weeks, I’ve been learning more about some mysterious thing
called <em>Generative Adversarial Networks</em> (GANs). GANs originally came out of a
2014 NIPS paper (<a href="https://arxiv.org/abs/1406.2661">read it here</a>) and have had a remarkable impact on machine
learning. I’m surprised that, until I was the TA for Berkeley’s Deep Learning
class last semester, I had never heard of GANs before.<sup id="fnref:goodfellow"><a href="#fn:goodfellow" class="footnote">1</a></sup></p>
<p>They certainly haven’t gone unnoticed in the machine learning community, though.
Yann LeCun, one of the leaders in the Deep Learning community, had this to say
about them <a href="https://www.quora.com/session/Yann-LeCun/1">during his Quora session</a> on July 28, 2016:</p>
<blockquote>
<p>The most important one, in my opinion, is adversarial training (also called
GAN for Generative Adversarial Networks). This is an idea that was originally
proposed by Ian Goodfellow when he was a student with Yoshua Bengio at the
University of Montreal (he since moved to Google Brain and recently to
OpenAI).</p>
</blockquote>
<blockquote>
<p>This, and the variations that are now being proposed is the most interesting
idea in the last 10 years in ML, in my opinion.</p>
</blockquote>
<p>If he says something like that about GANs, then I have no excuse for not
learning about them. Thus, I read what is probably the highest-quality general
overview available nowadays: <a href="https://arxiv.org/abs/1701.00160">Ian Goodfellow’s tutorial on arXiv</a>, which he
then presented in some form at NIPS 2016. This was really helpful for me, and I
hope that later, I can write something like this (but on another topic in AI).</p>
<p>I won’t repeat what GANs can do here. Rather, I’m more interested in knowing how
GANs are <em>trained</em>. Following now are some of the most important insights I
gained from reading the tutorial:</p>
<ul>
<li>
<p><strong>Major Insight 1</strong>: the discriminator’s loss function is the cross entropy
loss function. To understand this, let’s suppose we’re doing some binary
classification with some trainable function <script type="math/tex">D: \mathbb{R}^n \to [0,1]</script> that
we wish to optimize, where <script type="math/tex">D</script> indicates the estimated probability of some
data point <script type="math/tex">x_i \in \mathbb{R}^n</script> being in the first class. To get the
predicted probability of being in the second class, we just do <script type="math/tex">1-D(x_i)</script>.
The output of <script type="math/tex">D</script> must therefore be constrained in <script type="math/tex">[0,1]</script>, which is easy
to do if we tack on a sigmoid layer at the end. Furthermore, let <script type="math/tex">(x_i,y_i)
\in (\mathbb{R}^n, \{0,1\})</script> be the input-label pairing for training data
points.</p>
<p>The <strong>cross entropy</strong> between two distributions, which we’ll call <script type="math/tex">p</script> and
<script type="math/tex">q</script>, is defined as</p>
<script type="math/tex; mode=display">H(p,q) := -\sum_i p_i \log q_i</script>
<p>where <script type="math/tex">p</script> and <script type="math/tex">q</script> denote a “true” and an “empirical/estimated”
distribution, respectively. Both are <em>discrete</em> distributions, hence we can
sum over their individual components, denoted with <script type="math/tex">i</script>. (We would need to
have an integral instead of a sum if they were continuous.)</p>
<p>To apply this loss function to the current binary classification task, we
define the true distribution as <script type="math/tex">\mathbb{P}[y_i = 0] = 1</script> if <script type="math/tex">y_i=0</script>, or
<script type="math/tex">\mathbb{P}[y_i = 1] = 1</script> if <script type="math/tex">y_i=1</script>. Putting in 2-D vector form, it’s
either <script type="math/tex">[1,0]</script> or <script type="math/tex">[0,1]</script>. Intuitively, we know for sure which class this
belongs to, so it makes sense for a probability distribution to be a “one-hot”
vector.</p>
<p>Thus, for one data point <script type="math/tex">x_1</script> and its label, we get the following loss
function, where here I’ve changed the input to be more precise:</p>
<script type="math/tex; mode=display">H((x_1,y_1),D) = - y_1 \log D(x_1) - (1-y_1) \log (1-D(x_1))</script>
<p>Let’s look at the above function. Notice that only one of the two terms is
going to be zero, depending on the value of <script type="math/tex">y_1</script>, which makes sense since
it’s defining a distribution which is either <script type="math/tex">[0,1]</script> or <script type="math/tex">[1,0]</script>. The other
part is the estimated distribution from <script type="math/tex">D</script>. In both cases (the true and
predicted distributions) we are encoding a 2-D distribution with one value,
which lets us treat <script type="math/tex">D</script> as a real-valued function.</p>
<p>That was for one data point. Summing over the entire dataset of <script type="math/tex">N</script>
elements, we get something that looks like this:</p>
<script type="math/tex; mode=display">H((x_i,y_i)_{i=1}^N,D) = - \sum_{i=1}^N y_i \log D(x_i) - \sum_{i=1}^N (1-y_i) \log (1-D(x_i))</script>
<p>In the case of GANs, we can say a little more about what these terms mean. In
particular, our <script type="math/tex">x_i</script>s only come from two sources: either <script type="math/tex">x_i \sim p_{\rm
data}</script>, the true data distribution, or <script type="math/tex">x_i = G(z)</script> where <script type="math/tex">z \sim p_{\rm
generator}</script>, the generator’s distribution, based on some input code <script type="math/tex">z</script>. It
might be <script type="math/tex">z \sim {\rm Unif}[0,1]</script> but we will leave it unspecified.</p>
<p>In addition, we also want exactly half of the data to come from these two
sources.</p>
<p>To apply this to the sum above, we need to encode this probabilistically, so
we replace the sums with expectations, the <script type="math/tex">y_i</script> labels with <script type="math/tex">1/2</script>, and we
can furthermore replace the <script type="math/tex">\log (1-D(x_i))</script> term with <script type="math/tex">\log (1-D(G(z)))</script>
under some sampled code <script type="math/tex">z</script> for the generator. We get</p>
<script type="math/tex; mode=display">H((x_i,y_i)_{i=1}^\infty,D) = - \frac{1}{2} \mathbb{E}_{x \sim p_{\rm
data}}\Big[ \log D(x)\Big] - \frac{1}{2} \mathbb{E}_{z} \Big[\log
(1-D(G(z)))\Big]</script>
<p>This is precisely the loss function for the discriminator, <script type="math/tex">J^{(J)}</script>.</p>
</li>
<li>
<p><strong>Major Insight 2</strong>: understanding how gradient saturation may or may not
adversely affect training. Gradient saturation is a general problem when
gradients are too small (i.e. zero) to perform any learning. See <a href="http://cs231n.github.io/neural-networks-1/">Stanford’s
CS 231n notes on gradient saturation here</a> for more details. In the context
of GANs, gradient saturation may happen due to poor design of the generator’s
loss function, so this “major insight” of mine is also based on understanding
the tradeoffs among different loss functions for the generator. This design,
incidentally, is where we can be creative; the discriminator needs the cross
entropy loss function above since it has a very specific function (to
discriminate among two classes) and the cross entropy is the “best” way of
doing this.</p>
<p>Using Goodfellow’s notation, we have the following candidates for the
generator loss function, as discussed in the tutorial. The first is the
<strong>minimax</strong> version:</p>
<script type="math/tex; mode=display">J^{(G)} = -J^{(J)} = \frac{1}{2} \mathbb{E}_{x \sim p_{\rm
data}}\Big[ \log D(x)\Big] + \frac{1}{2} \mathbb{E}_{z} \Big[\log
(1-D(G(z)))\Big]</script>
<p>The second is the <strong>heuristic, non-saturating</strong> version:</p>
<script type="math/tex; mode=display">J^{(G)} = -\frac{1}{2}\mathbb{E}_z\Big[\log D(G(z))\Big]</script>
<p>Finally, the third is the <strong>maximum likelihood</strong> version:</p>
<script type="math/tex; mode=display">J^{(G)} = -\frac{1}{2}\mathbb{E}_z\left[e^{\sigma^{-1}(D(G(z)))}\right]</script>
<p>What are the advantages and disadvantages of these generator loss functions?
For the minimax version, it’s simple and allows for easier theoretical
results, but in practice its not that useful, due to gradient saturation. As
Goodfellow notes:</p>
<blockquote>
<p>In the minimax game, the discriminator minimizes a cross-entropy, but the
generator maximizes the same cross-entropy. This is unfortunate for the
generator, because when the discriminator successfully rejects generator
samples with high confidence, the generator’s gradient vanishes.</p>
</blockquote>
<p>As suggested in <a href="http://neuralnetworksanddeeplearning.com/chap3.html">Chapter 3 of Michael Nielsen’s excellent online
book</a>, the cross-entropy is a great loss function since it is designed in
part to accelerate learning and avoid gradient saturation only up to when the
classifier is correct (since we don’t want the gradient to move in that
case!).</p>
<p>I’m not sure how to clearly describe this formally. For now, I will defer to
Figure 16 in Goodfellow’s tutorial (see the top of this blog post), which
nicely shows the value of <script type="math/tex">J^{(G)}</script> as a function of the discriminator’s
output, <script type="math/tex">D(G(z))</script>. Indeed, when the discriminator is winning, we’re at the
left side of the graph, since the discriminator outputs the probability of the
sample being from the <em>true</em> data distribution.</p>
<p>By the way, why is <script type="math/tex">J^{(G)} = -J^{(J)}</script> only a function of <script type="math/tex">D(G(z))</script> as
suggested by the figure? What about the other term in <script type="math/tex">J^{(J)}</script>? Notice that
of the two terms in the loss function, the first one is <em>only</em> a function of
the discriminator’s parameters! The second part, which uses the <script type="math/tex">D(G(z))</script>
term, depends on both <script type="math/tex">D</script> and <script type="math/tex">G</script>. Hence, for the purposes of performing
gradient descent with respect to the parameters of <script type="math/tex">G</script>, only the second term
in <script type="math/tex">J^{(J)}</script> matters; the first term is a constant that disappears after
taking derivatives <script type="math/tex">\nabla_{\theta^{(G)}}</script>.</p>
<p>The figure makes it clear that the generator will have a hard time doing any
sort of gradient update at the left portion of the graph, since the
derivatives are close to zero. The problem is that the left portion of the
graph represents the most common case when starting the game. The generator,
after all, starts out with basically random parameters, so the discriminator
can easily tell what is real and what is fake.<sup id="fnref:discriminator"><a href="#fn:discriminator" class="footnote">2</a></sup></p>
<p>Let’s move on to the other two generator cost functions. The second one, the
heuristically-motivated one, uses the idea that the generator’s gradient only
depends on the second term in <script type="math/tex">J^{(J)}</script>. Instead of flipping the sign of
<script type="math/tex">J^{(J)}</script>, they instead flip the target: changing <script type="math/tex">\log (1-D(G(z)))</script> to
<script type="math/tex">\log D(G(z))</script>. In other words, the “sign flipping” happens at a different
part, so the generator still optimizes something “opposite” of the
discriminator. From this re-formulation, it appears from the figure above
that <script type="math/tex">J^{(G)}</script> now has desirable gradients in the left portion of the graph.
Thus, the advantage here is that the generator gets a strong gradient signal
so that it can quickly improve. The downside is that it’s not easier to
analyze, but who cares?</p>
<p>Finally, the maximum likelihood cost function has the advantage of being
motivated based on maximum likelihood, which by itself has a lot of desirable
properties. Unfortunately, the figure above shows that it has a flat slope in
the left portion, though it seems to be slightly better than the minimax
version since it decreases rapidly “sooner.” Though that might not be an
“advantage,” since Goodfellow warns about high variance. That might be worth
thinking about in more detail.</p>
<p>One last note: the function <script type="math/tex">J^{(G)}</script>, at least for the three cost functions
here, does <em>not</em> depend directly on <script type="math/tex">x</script> at all! That’s interesting … and
in fact, Goodfellow argues that makes GANs resistant to overfitting since it
can’t copy from <script type="math/tex">x</script>.</p>
</li>
</ul>
<p>I wish more tutorials like this existed for other AI concepts. I particularly
enjoyed the three exercises <em>and</em> the solutions within this tutorial on GANs. I
have <a href="https://github.com/DanielTakeshi/Paper_Notes/blob/master/deep_learning/NIPS_2016_Tutorial_Generative_Adversarial_Networks.md">more detailed notes here</a> in my <em>Paper Notes</em> GitHub repository (I
should have started this repository back in 2013). I highly recommend this
tutorial to anyone wanting to know more about GANs.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:goodfellow">
<p>Ian Goodfellow, the lead author on the GANs paper, was a guest
lecture for the class, where (obviously) he talked about GANs. <a href="#fnref:goodfellow" class="reversefootnote">↩</a></p>
</li>
<li id="fn:discriminator">
<p>Actually, the discriminator also starts out random, right? I
think the discriminator has an easier job, though, since supervised learning
is easier than generating realistic images (I mean, c’mon??) so perhaps the
discriminator simply learns faster, and the generator has to spend a lot of
time catching up. <a href="#fnref:discriminator" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 05 Mar 2017 09:30:00 +0000
https://danieltakeshi.github.io/2017/03/05/understanding-generative-adversarial-networks/
https://danieltakeshi.github.io/2017/03/05/understanding-generative-adversarial-networks/My Thoughts on CS 231n Being Forced To Take Down Videos<p><a href="http://cs231n.stanford.edu/">CS 231n: Convolutional Neural Networks for Visual Recognition</a> is, in my
biased opinion, one of the most important <em>and</em> thrilling courses offered by
Stanford University. It has been taught twice so far and will appear again
in the upcoming Spring quarter.</p>
<p>Due to its popularity, the course lectures for the second edition (Winter 2016)
were videotaped and released online. This is not unusual among computer science
graduate level courses due to high demand both inside and outside the
university.</p>
<p>Unfortunately, as <a href="http://cs231n.stanford.edu/">discussed in this rather large reddit discussion thread</a>,
Andrej Karpathy (one of the three instructors) was forced to pull down the
lecture videos. He later clarified on his Twitter account that the reason
had to do with the lack of captioning/subtitles in the lecture videos, which
relates to <a href="https://danieltakeshi.github.io/2015/02/14/harvard-and-mits-lack-of-closed-captions/">a news topic I blogged about just over two years ago</a>.</p>
<p>If you browse the reddit thread, you will see quite a lot of unhappy students.
I just joined reddit and I was hoping to make a comment there, but reddit
disables posting after six months. And after thinking about it, I thought it
would make more sense to write some brief thoughts here instead.</p>
<p>To start, I should state upfront that I have no idea what happened beyond the
stuff we can all read online. I don’t know who made the complaint, what the
course staff did, etc.</p>
<p>Here’s my stance regarding class policies on watching videos:</p>
<blockquote>
<p>If a class <em>requires</em> watching videos for whatever reason, then that video
should have subtitles. Otherwise, no such action is necessary, though the
course staff should attempt as much as is reasonable to have subtitles for all
videos.</p>
</blockquote>
<p>I remember two times when I had to face this problem of watching a non-subtitled
video as a homework assignment: in an introductory Women’s, Gender, and
Sexuality Studies course and an Africana Studies class about black athletes. For
the former, we were assigned to watch a video about a transgender couple, and
for the latter, the video was about black golfers. In both cases, the
professors gave me copies of the movie (other students didn’t get these) and I
watched one in a room myself with the volume cranked up and the other one with
another person who told me what was happening.</p>
<p>Is that ideal? Well, no. To (new) readers of this blog, welcome to the story of
my life!</p>
<p>More seriously, was I supposed to do something about it? The professors didn’t
make the videos, which were a tiny portion of the overall courses. I didn’t want
to get all up in arms about this, so in both cases, I brought it up with them
and they understood my situation (and apologized).</p>
<p>Admittedly, my brief stance above is incomplete and belies a vast gray area.
What if students are given the option of doing one of two “required”
assignments: watching a video or reading a book? That’s a gray area, though I
would personally lean that towards “required viewing” and thus “required
subtitles.”</p>
<p>Class lecture videos also fall in a gray area. They are not <em>required viewing</em>,
because students should attend lectures in person. Unfortunately, the lack of
subtitles for these videos definitely puts deaf and hard of hearing students
like myself at a disadvantage. I’ve lost count of the amount of lectures that I
wish I could have re-watched, but it extraordinarily difficult for me to do so
for non-subtitled videos.</p>
<p>Ultimately, however, as long as I can attend lectures and understand some of the
material, I do not worry about whether lecture videos have subtitles. Just
about every videotaped class that I have taken did not have subtitled lecture
videos, with one exception: CS 267 from Spring 2016, after <a href="https://danieltakeshi.github.io/2016-02-05-why-i-reluctantly-dont-show-up-to-class/">I had negotiated
about it with Berkeley’s DSP</a>.</p>
<p>Heck, the <a href="https://danieltakeshi.github.io/2016/12/19/reflections-on-being-a-gsi-for-deep-neural-networks-cs-294-129-at-berkeley/">CS 294-129 class <em>which I TA-ed for last semester</em></a> — which is
based on CS 231n! — had lecture videos. Were there captions? <em>Nope</em>.</p>
<p>Am I frustrated? Yes, but it’s <em>understandable</em> frustration due to the cost of
adding subtitles. As a similar example, I’m frustrated at the identity politics
practiced by the Democratic party, but it’s <em>understandable</em> frustration due to
what political science instructs us to do, which is why I’m not planning to jump
ship to another party.</p>
<p>Thus in my case, if I were a student in CS 231n, I would not be inclined to
pressure the staff to pull the videos down. Again, this comes with the obvious
caveat; I don’t know the situation and it might have been worse than I imagine.</p>
<p>As this discussion would imply, I don’t like pulling down lecture videos as
“collateral damage”.<sup id="fnref:lewin"><a href="#fn:lewin" class="footnote">1</a></sup> I worry, however, if that’s in part because I’m too
timid. Hypothetically and broadly speaking, if I have to take out my
frustration (e.g. with lawsuits) on certain things, I don’t want to do this for
something like lecture videos, which would make a number of folks angry at me,
whether or not they openly express it.</p>
<p>On a more positive note … it turns out that, actually, <em>the CS 231n lecture
videos are online</em>! I’m not sure why, but I’m happy. Using YouTube’s automatic
captions, I watched one of the lectures and finally understood a concept that
was critical and essential for me to know when I was writing <a href="https://danieltakeshi.github.io/2017/01/21/understanding-higher-order-local-gradient-computation-for-backpropagation-in-deep-neural-networks/">my latest
technical blog post</a>.</p>
<p>Moreover, the automatic captions are getting better and better each year. They
work pretty well on Andrej, who has a slight accent (Russian?). I dislike
attending research talks if I don’t understand what’s going on, but given that
so many are videotaped these days, whether at Berkeley or at conferences, maybe
watching them offline is finally becoming a viable alternative.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:lewin">
<p>In another case where lecture videos had to be removed, consider MIT’s
Open Courseware and Professor Walter Lewin’s famous physics lectures. MIT
removed the videos after it was found that Lewin had sexually harassed some
of his students. Lewin’s harassment disgusted me, but I respectfully
disagreed with MIT’s position about removing his videos, siding with
then-MIT professor Scott Aaronson. In an <a href="http://www.scottaaronson.com/blog/?p=2091">infamous blog post</a>, Professor
Aaronson explained why he opposed the removal of the videos, which
subsequently caused him to be the subject of a hate-rage/attack.
Consequently, I am now a permanent reader of his blog. <a href="#fnref:lewin" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 25 Feb 2017 09:00:00 +0000
https://danieltakeshi.github.io/2017/02/25/my-thoughts-on-cs231n-being-forced-to-take-down-videos/
https://danieltakeshi.github.io/2017/02/25/my-thoughts-on-cs231n-being-forced-to-take-down-videos/These Aren't Your Father's Hearing Aids<p style="text-align:center;"> <img src="https://danieltakeshi.github.io/assets/dynamo-product-tile-02.jpg" alt="oticon_dynamo" /> </p>
<p>I am now wearing <a href="https://www.oticon.com/solutions/dynamo/">Oticon Dynamo hearing aids</a>. The good news is that I’ve
<a href="https://danieltakeshi.github.io/2016-04-23-a-nice-running-route-through-the-berkeley-marina-and-cesar-chavez-park/">run many times with them</a> and so far have not had issues with water
resistance.</p>
<p>However, I wanted to bring up a striking point that really made me realize about
how our world has changed remarkably in the last few years.</p>
<p>A few months ago, when I was first fitted with the hearing aids, my audiologist
set the default volume level to be “on target” for me. The hearing aid is
designed to provide different amounts of power to people depending on their raw
hearing level. There’s a volume control on it which goes from “1” (weak) to “4”
(powerful), which I can easily adjust as I wish. The baseline setting is “3”,
but this baseline is what audiologist adjust on a case-by-case basis. This means
my “3” (and thus, my “1” and “4” settings) may be more powerful, less powerful,
or the same compared to the respective settings for someone else.</p>
<p>When my audiologist first fit the hearing aids for me, I felt that my left
hearing aid was too quiet and my right one too loud by default, so she modified
the baselines.</p>
<p>She also, critically, gave me about a week to adjust to the hearing aids, and I
was to report back on whether its strength was correctly set.</p>
<p>During that week, I wore the hearing aids, but I then decided that I was
originally mistaken about both hearing aids, since I had to repeatedly
<em>increase</em> the volume for the left one and <em>decrease</em> the volume for the right
one.</p>
<p>I reported back to my audiologist and said that she was right all along, and
that my baselines needed to be back to their default levels. She was able to
corroborate my intuition by showing me — amazingly – <em>how often I had
adjusted the hearing aid volume level</em>, and <em>in which direction</em>.</p>
<p>Hearing aids are, apparently, now fitted with these advanced sensors so they can
track exactly how you adjust them (volume controls or otherwise).</p>
<p>The lesson is that just about everything nowadays consists of <em>sensors</em>, a point
which is highlighted in Thomas L. Friedman’s excellent book <em><a href="http://www.thomaslfriedman.com/thank-you-for-being-late/">Thank You for
Being Late</a></em>. It is also a characteristic of what computer scientists refer
to as the “<a href="https://en.wikipedia.org/wiki/Internet_of_things">Internet of Things</a>.”</p>
<p>Obviously, these certainly aren’t the hearing aids your father wore when he was
young.</p>
Sun, 12 Feb 2017 06:00:00 +0000
https://danieltakeshi.github.io/2017/02/12/these-arent-your-fathers-hearing-aids/
https://danieltakeshi.github.io/2017/02/12/these-arent-your-fathers-hearing-aids/Academics Against Immigration Executive Order<p>I just signed a petition, <a href="https://notoimmigrationban.com">Academics Against Immigration Executive Order</a> to
oppose the Trump administration’s recent executive order. You can <a href="https://www.nytimes.com/2017/01/27/us/politics/refugee-muslim-executive-order-trump.html">find the full
text here</a> along with the names of those who have signed up. (Graduate
students are in the “Other Signatories” category and may take a while to
update.) I like this petition because it clearly lists the names of people so as
to avoid claims of duplication and/or bogus signatures for anonymous petitions.
There are <em>lots</em> of academic superstars on the list, including (I’m proud to
say) my current statistics professor Michael I. Jordan and my <a href="https://danieltakeshi.github.io/2016/12/20/review-of-theoretical-statistics-stat-210a-at-berkeley/">statistics
professor William Fithian from last semester</a>.</p>
<p>The petition lists three compelling reasons to oppose the order, but let me just
chime in with some extra thoughts.</p>
<p>I understand the need to keep our country safe. But in order to do so, there has
to be a correct tradeoff in terms of security versus profiling (for lack of a
better word) and in terms of costs versus benefits.</p>
<p>On the spectrum of security, to one end are those who deny the existence of
radical Islam and the impact of religion on terrorism. On the other end are
those who would happily ban an entire religion and place the blame and burden on
millions of law-abiding people fleeing oppression. This order is far too close
to the second end.</p>
<p>In terms of costs and benefits, I find an analogy to policing useful. Mayors and
police chiefs shouldn’t be assigning their police officers uniformly throughout
cities. The police should be targeted in certain hotspots of crime as indicated
by past trends. That’s the most logical and cost-effective way to crack down on
crime.</p>
<p>Likewise, if were are serious about stopping radical Islamic terrorism, putting
a blanket ban on Muslims is like the “uniform policing strategy” and will also
cause additional problems since Muslims would (understandably!) feel unfairly
targeted. For instance, Iran is already <a href="http://www.wsj.com/articles/iran-promises-proportional-response-for-donald-trumps-immigration-ban-1485629265">promising “proportional
responses”</a>. I also have to mention that <a href="http://www.vox.com/2016/9/13/12901950/terrorism-immigrants-clothes">the odds of being killed by a
refugee terrorist are so low</a> that the amount of anxiety towards them does
not justify the cost.</p>
<p>By the way, I’m still waiting for when Saudi Arabia — the source of 15 out of
19 terrorists responsible for 9/11 — gets on the executive order list. I
guess President Trump has business dealings there? (Needless to say, that’s why
conflict of interest laws exist.)</p>
<p>I encourage American academics to take a look at this order and (hopefully) sign
the petition. I also urge our Secretary of Defense, James Mattis, to talk to
Trump and get him to rescind and substantially revise the order. While I didn’t
state this publicly to anyone, I have more respect for Mattis than any one else
in the Trump cabinet, and hopefully that will remain the case.</p>
Sat, 28 Jan 2017 20:00:00 +0000
https://danieltakeshi.github.io/2017/01/28/academics-against-immigration-executive-order/
https://danieltakeshi.github.io/2017/01/28/academics-against-immigration-executive-order/Understanding Higher Order Local Gradient Computation for Backpropagation in Deep Neural Networks<h2 id="introduction">Introduction</h2>
<p>One of the major difficulties in understanding how neural networks work is due
to the backpropagation algorithm. There are endless texts and online guides on
backpropagation, but most are useless. I read several explanations of
backpropagation when I learned about it from 2013 to 2014, but I never felt like
I <em>really</em> understood it until I <a href="https://danieltakeshi.github.io/2016/12/19/reflections-on-being-a-gsi-for-deep-neural-networks-cs-294-129-at-berkeley/">took/TA-ed the Deep Neural Networks class</a>
at Berkeley, based on the excellent Stanford CS 231n course.</p>
<p>The <a href="http://cs231n.github.io/optimization-2/">course notes from CS 231n include a tutorial</a> on how to compute
gradients for <em>local nodes in computational graphs</em>, which I think is key to
understanding backpropagation. However, the notes are mostly for the
one-dimensional case, and their main advice for extending gradient computation
to the vector or matrix case is to keep track of dimensions. That’s perfectly
fine, and in fact that was how I managed to get through the second CS 231n
assignment.</p>
<p>But this felt unsatisfying.</p>
<p>For some of the harder gradient computations, I had to test several different
ideas before passing the gradient checker, and sometimes I wasn’t even sure why
my code worked! Thus, the purpose of this post is to make sure I deeply
understand how gradient computation works.</p>
<p><em>Note: I’ve had this post in draft stage for a long time. However, I just found
out that the notes from CS 231n have been <a href="http://cs231n.stanford.edu/vecDerivs.pdf">updated with a guide</a> from Erik
Learned-Miller on taking matrix/vector derivatives. That’s worth checking out,
but fortunately, the content I provide here is mostly distinct from his
material.</em></p>
<h2 id="the-basics-computational-graphs-in-one-dimension">The Basics: Computational Graphs in One Dimension</h2>
<p>I won’t belabor the details on one-dimensional graphs since I assume the reader
has read the corresponding Stanford CS 231n guide. Another nice post is <a href="http://colah.github.io/posts/2015-08-Backprop/">from
Chris Olah’s excellent blog</a>. For my own benefit, I reviewed derivatives on
computational graphs by going through the CS 231n example with sigmoids (but
with the sigmoid computation spread out among finer-grained operations). You
can see my hand-written computations in the following image. Sorry, I have
absolutely no skill in getting this up quickly using tikz, Inkscape, or other
visualization tactics/software. Feel free to right-click and open the image in a
new tab. Warning: it’s big. (But I have to say, the iPhone7 plus makes <em>really
nice</em> images. I remember the good old days when we had to take our cameras to
CVS to get them developed…)</p>
<p style="text-align:center;"> <img src="https://danieltakeshi.github.io/assets/backprop_basics_cs231n.jpg" alt="backprop_example" /> </p>
<p><em>Another note:</em> from the image, you can see that this is from the fourth lecture
of CS 231n class. I watched that video on YouTube, which is excellent and of
high-quality. Fortunately, there are also automatic captions which are
highly accurate. (There’s <a href="https://www.reddit.com/r/MachineLearning/comments/4hqwza/andrej_karpathy_forced_to_take_down_stanford/">an archived reddit thread</a> discussing how Andrej
Karpathy had to take down the videos <a href="http://127.0.0.1:4000/2015/02/14/harvard-and-mits-lack-of-closed-captions/">due to a related lawsuit I blogged about
earlier</a>, but I can see them just fine. Did they get back up somehow? I’ll
write more about this at a later date.)</p>
<p>When I was going through the math here, I came up with several rules to myself:</p>
<ol>
<li>
<p>There’s a lot of notation that can get confusing, so for simplicity, I always
denoted <em>inputs</em> as <script type="math/tex">f_1,f_2,\ldots</script> and <em>outputs</em> as <script type="math/tex">g_1,g_2,\ldots</script>,
though in this example, we only have one output at each step. By doing this, I
can view the <script type="math/tex">g_1</script>s as a function of the <script type="math/tex">f_i</script> terms, so the <em>local</em>
gradient turns into <script type="math/tex">\frac{\partial g_1}{\partial f_i}</script> and then I can
substitute <script type="math/tex">g_1</script> in terms of the inputs.</p>
</li>
<li>
<p>When doing backpropgation, I analyzed it <em>node-by-node</em>, and the boxes I drew
in my image contain a number which indicates the order I evaluated them. (I
skipped a few repeat blocks just as the lecture did.) Note that when
filling in my boxes, I <em>only</em> used the node and any incoming/outgoing arrows.
Also, the <script type="math/tex">f_i</script> and <script type="math/tex">g_i</script> keep getting repeated, i.e. the next step will
have <script type="math/tex">g_i</script> equal to whatever the <script type="math/tex">f_i</script> was in the previous block.</p>
</li>
<li>
<p>Always remember that when we have arrows here, the part <em>above the arrow</em>
contains the value of <script type="math/tex">f_i</script> (respectively, <script type="math/tex">g_i</script>) and <em>below the
arrow</em> we have <script type="math/tex">\frac{\partial L}{\partial f_i}</script> (respectively
<script type="math/tex">\frac{\partial L}{\partial g_i}</script>).</p>
</li>
</ol>
<p>Hopefully this will be helpful to beginners using computational graphs.</p>
<h2 id="vectormatrixtensor-derivatives-with-examples">Vector/Matrix/Tensor Derivatives, With Examples</h2>
<p>Now let’s get to the big guns — vectors/matrices/tensors. Vectors are a
special case of matrices, which are a special case of tensors, the most
generalized <script type="math/tex">n</script>-dimensional array. For this section, I will continue using the
“partial derivative” notation <script type="math/tex">\frac{\partial}{\partial x}</script> to represent any
derivative form (scalar, vector, or matrix).</p>
<h3 id="relu">ReLU</h3>
<p>Our first example will be with <strong>ReLU</strong>s, because that was covered a bit in the
CS 231n lecture. Let’s suppose <script type="math/tex">x \in \mathbb{R}^3</script>, a 3-D column vector
representing some data from a hidden layer deep into the network. The ReLU
operation’s forward pass is extremely simple: <script type="math/tex">y = \max\{0,x\}</script>, which can be
vectorized using <code class="highlighter-rouge">np.max</code>.</p>
<p>The backward pass is where things get tricky. The input is a 3-D vector, and so
is the output! Hence, taking the derivative of the function <script type="math/tex">y(x):
\mathbb{R}^3\to \mathbb{R}^3</script> means we have to consider the effect of every
<script type="math/tex">x_i</script> on every <script type="math/tex">y_j</script>. The only way that’s possible is to use Jacobians.
Using the example here, denoting the derivative as <script type="math/tex">\frac{\partial y}{\partial
x}</script> where <script type="math/tex">y(x)</script> is a function of <script type="math/tex">x</script>, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial y}{\partial x} &=
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} &\frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3}\\
\frac{\partial y_2}{\partial x_1} &\frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3}\\
\frac{\partial y_3}{\partial x_1} &\frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3}
\end{bmatrix}\\
&=
\begin{bmatrix}
\frac{\partial}{\partial x_1}\max\{0,x_1\} &\frac{\partial}{\partial x_2}\max\{0,x_1\} & \frac{\partial}{\partial x_3}\max\{0,x_1\}\\
\frac{\partial}{\partial x_1}\max\{0,x_2\} &\frac{\partial}{\partial x_2}\max\{0,x_2\} & \frac{\partial}{\partial x_3}\max\{0,x_2\}\\
\frac{\partial}{\partial x_1}\max\{0,x_3\} &\frac{\partial}{\partial x_2}\max\{0,x_3\} & \frac{\partial}{\partial x_3}\max\{0,x_3\}
\end{bmatrix}\\
&=
\begin{bmatrix}
1\{x_1>0\} & 0 & 0 \\
0 & 1\{x_2>0\} & 0 \\
0 & 0 & 1\{x_3>0\}
\end{bmatrix}
\end{align*} %]]></script>
<p>The most interesting part of this happens when we expand the Jacobian and see
that we have a bunch of derivatives, but <em>they all evaluate to zero on the
off-diagonal</em>. After all, the effect (i.e. derivative) of <script type="math/tex">x_2</script> will be zero
for the function <script type="math/tex">\max\{0,x_3\}</script>. The diagonal term is only slightly more
complicated: an indicator function (which evaluates to either 0 or 1) depending
on the outcome of the ReLU. This means we have to <em>cache</em> the result of the
forward pass, which easy to do in the CS 231n assignments.</p>
<p>How does this get combined into the incoming (i.e. “upstream”) gradient, which
is a <em>vector</em> <script type="math/tex">\frac{\partial L}{\partial y}</script>. We perform a matrix times
vector operation with that and our Jacobian from above. Thus, the overall
gradient we have for <script type="math/tex">x</script> with respect to the loss function, which is what we
wanted all along, is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\partial L}{\partial x} =
\begin{bmatrix}
1\{x_1>0\} & 0 & 0 \\
0 & 1\{x_2>0\} & 0 \\
0 & 0 & 1\{x_3>0\}
\end{bmatrix}
\cdot \frac{\partial L}{\partial y} %]]></script>
<p>This is as simple as doing <code class="highlighter-rouge">mask * y_grad</code> where <code class="highlighter-rouge">mask</code> is a numpy array with 0s
and 1s depending on the value of the indicator functions, and <code class="highlighter-rouge">y_grad</code> is the
upstream derivative/gradient. In other words, we can completely bypass the
Jacobian computation in our Python code! Another option is to use <code class="highlighter-rouge">y_grad[x <=
0] = 0</code>, where <code class="highlighter-rouge">x</code> is the data that was passed in the forward pass (just before
ReLU was applied). In numpy, this will set all indices to which the condition <code class="highlighter-rouge">x
<= 0</code> is true to have zero value, precisely clearing out the gradients where we
need it cleared.</p>
<p>In practice, we tend to use <em>mini-batches</em> of data, so instead of a single <script type="math/tex">x
\in \mathbb{R}^3</script>, we have a matrix <script type="math/tex">X \in \mathbb{R}^{3 \times n}</script> with
<script type="math/tex">n</script> columns.<sup id="fnref:tensor"><a href="#fn:tensor" class="footnote">1</a></sup> Denote the <script type="math/tex">i</script>th column as <script type="math/tex">x^{(i)}</script>. Writing out
the full Jacobian is too cumbersome in this case, but to visualize it, think of
having <script type="math/tex">n=2</script> and then stacking the two samples <script type="math/tex">x^{(1)},x^{(2)}</script> into a
six-dimensional vector. Do the same for the output <script type="math/tex">y^{(1)},y^{(2)}</script>. The
Jacobian turns out to again be a diagonal matrix, particularly because the
derivative of <script type="math/tex">x^{(i)}</script> on the output <script type="math/tex">y^{(j)}</script> is zero for <script type="math/tex">i \ne j</script>.
Thus, we can again use a simple masking, element-wise multiply on the upstream
gradient to compute the local gradient of <script type="math/tex">x</script> w.r.t. <script type="math/tex">y</script>. In our code we
don’t have to do any “stacking/destacking”; we can actually use the exact same
code <code class="highlighter-rouge">mask * y_grad</code> with both of these being 2-D numpy arrays (i.e. matrices)
rather than 1-D numpy arrays. The case is similar for larger minibatch sizes
using <script type="math/tex">n>2</script> samples.</p>
<p><em>Remark</em>: this process of computing derivatives will be similar to other
activation functions because they are <em>elementwise</em> operations.</p>
<h3 id="affine-layer-fully-connected-biases">Affine Layer (Fully Connected), Biases</h3>
<p>Now let’s discuss a layer which <em>isn’t</em> elementwise: the fully connected layer
operation <script type="math/tex">WX+b</script>. How do we compute gradients? To start, let’s consider one
3-D element <script type="math/tex">x</script> so that our operation is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
W_{11} & W_{12} & W_{13} \\
W_{21} & W_{22} & W_{23}
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1 \\
b_2
\end{bmatrix}
=
\begin{bmatrix}
y_1 \\
y_2
\end{bmatrix} %]]></script>
<p>According to the chain rule, the local gradient with respect to <script type="math/tex">b</script> is</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial b} =
\underbrace{\frac{\partial y}{\partial b}}_{2\times 2}
\cdot
\underbrace{\frac{\partial L}{\partial y}}_{2\times 1}</script>
<p>Since we’re doing backpropagation, we can assume the upstream derivative is
given, so we only need to compute the <script type="math/tex">2\times 2</script> Jacobian. To do so, observe
that</p>
<script type="math/tex; mode=display">\frac{\partial y_1}{\partial b_1} = \frac{\partial}{\partial b_1} (W_{11}x_1+W_{12}x_2+W_{13}x_3+b_1) = 1</script>
<p>and a similar case happens for the second component. The off-diagonal terms are
zero in the Jacobian since <script type="math/tex">b_i</script> has no effect on <script type="math/tex">y_j</script> for <script type="math/tex">i\ne j</script>.
Hence, the local derivative is</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\partial L}{\partial b} =
\begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
\cdot
\frac{\partial L}{\partial y} =
\frac{\partial L}{\partial y} %]]></script>
<p>That’s pretty nice — all we need to do is copy the upstream derivative. No
additional work necessary!</p>
<p>Now let’s get more realistic. How do we extend this when <script type="math/tex">X</script> is a matrix?
Let’s continue the same notation as we did in the ReLU case, so that our columns
are <script type="math/tex">x^{(i)}</script> for <script type="math/tex">i=\{1,2,\ldots,n\}</script>. Thus, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
W_{11} & W_{12} & W_{13} \\
W_{21} & W_{22} & W_{23}
\end{bmatrix}
\begin{bmatrix}
x_1^{(1)} & \cdots & x_1^{(n)} \\
x_2^{(1)} & \cdots & x_2^{(n)} \\
x_3^{(1)} & \cdots & x_3^{(n)}
\end{bmatrix}
+
\begin{bmatrix}
b_1 & \cdots & b_1 \\
b_2 & \cdots & b_2
\end{bmatrix}
=
\begin{bmatrix}
y_1^{(1)} & \cdots & y_1^{(n)} \\
y_2^{(1)} & \cdots & y_2^{(n)}
\end{bmatrix} %]]></script>
<p><em>Remark</em>: crucially, notice that the elements of <script type="math/tex">b</script> are <em>repeated</em> across
columns.</p>
<p>How do we compute the local derivative? We can try writing out the derivative
rule as we did before:</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial b} =
\frac{\partial y}{\partial b}
\cdot
\frac{\partial L}{\partial y}</script>
<p>but the problem is that this isn’t matrix multiplication. Here, <script type="math/tex">y</script> is a
function from <script type="math/tex">\mathbb{R}^2</script> to <script type="math/tex">\mathbb{R}^{2\times n}</script>, and to evaluate
the derivative, it seems like we would need a 3-D matrix for full generality.</p>
<p>Fortunately, there’s an easier way with <em>computational graphs</em>. If you draw out
the computational graph and create nodes for <script type="math/tex">Wx^{(1)}, \ldots, Wx^{(n)}</script>, you
see that you have to write <script type="math/tex">n</script> <em>plus</em> nodes to get the output, each of which
takes in one of these <script type="math/tex">Wx^{(i)}</script> terms along with adding <script type="math/tex">b</script>. Then this
produces <script type="math/tex">y^{(i)}</script>. See my hand-drawn diagram:</p>
<p style="text-align:center;"> <img src="https://danieltakeshi.github.io/assets/backprop_basics2_cs231n.JPG" alt="backprop_example2" /> </p>
<p>This captures the key property of independence among the samples in <script type="math/tex">X</script>. To
compute the local gradients for <script type="math/tex">b</script>, it therefore suffices to compute the
local gradients for each of the <script type="math/tex">y^{(i)}</script> and then <em>add</em> them together. (The
rule in computational graphs is to <em>add</em> incoming derivatives, which can be
verified by looking at trivial 1-D examples.) The gradient is</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial b} =
\sum_{i=1}^n
\frac{\partial y^{(i)}}{\partial b}
\frac{\partial L}{\partial y^{(i)}}
=
\sum_{i=1}^n
\frac{\partial L}{\partial y^{(i)}}</script>
<p>See what happened? This immediately reduced to the same case we had earlier,
with a <script type="math/tex">2\times 2</script> Jacobian being multiplied by a <script type="math/tex">2\times 1</script> upstream
derivative. All of the Jacobians turn out to be the identity, meaning that the
final derivative <script type="math/tex">\frac{\partial L}{\partial b}</script> is the sum of the columns of
the original upstream derivative matrix <script type="math/tex">Y</script>. As a sanity check, this is a
<script type="math/tex">(2\times 1)</script>-dimensional vector, as desired. In numpy, one can do this with
something similar to <code class="highlighter-rouge">np.sum(Y_grad)</code>, though you’ll probably need the <code class="highlighter-rouge">axis</code>
argument to make sure the sum is across the appropriate dimension.</p>
<h3 id="affine-layer-fully-connected-weight-matrix">Affine Layer (Fully Connected), Weight Matrix</h3>
<p>Going from biases, which are represented by vectors, to <em>weights</em>, which are
represented by matrices, brings some extra difficulty due to that extra
dimension.</p>
<p>Let’s focus on the case with one sample <script type="math/tex">x^{(1)}</script>. For the derivative with
respect to <script type="math/tex">W</script>, we can ignore <script type="math/tex">b</script> since the multivariate chain rule states
that the expression <script type="math/tex">y^{(1)}=Wx^{(1)}+b</script> differentiated with respect to <script type="math/tex">W</script>
causes <script type="math/tex">b</script> to disappear, just like in the scalar case.</p>
<p>The harder part is dealing with the chain rule for the <script type="math/tex">Wx^{(1)}</script> expression,
because we can’t write the expression “<script type="math/tex">\frac{\partial}{\partial W}
Wx^{(1)}</script>”. The function <script type="math/tex">Wx^{(1)}</script> is a <em>vector</em>, and the variable we’re
differentiating here is a <em>matrix</em>. Thus, we’d again need a 3-D like matrix to
contain the derivatives.</p>
<p>Fortunately, there’s an easier way with the chain rule. We can still use the
rule, except we have to <em>sum over the intermediate components</em>, as specified by
the chain rule for higher dimensions; <a href="https://en.wikipedia.org/wiki/Chain_rule">see the Wikipedia article for more
details and justification</a>. Our “intermediate component” here is the
<script type="math/tex">y^{(1)}</script> vector, which has two components. We therefore have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial L}{\partial W} &=
\sum_{i=1}^2 \frac{\partial L}{\partial y_i^{(1)}} \frac{\partial y_i^{(1)}}{\partial W} \\
&= \frac{\partial L}{\partial y_1^{(1)}}\begin{bmatrix}x_1^{(1)}&x_2^{(1)}&x_3^{(1)}\\0&0&0\end{bmatrix} +
\frac{\partial L}{\partial y_2^{(1)}}\begin{bmatrix}0&0&0\\x_1^{(1)}&x_2^{(1)}&x_3^{(1)}\end{bmatrix} \\
&= \begin{bmatrix} \frac{\partial L}{\partial y_1^{(1)}}x_1^{(1)} & \frac{\partial
L}{\partial y_1^{(1)}} x_2^{(1)}& \frac{\partial L}{\partial y_1^{(1)}} x_3^{(1)}\\ \frac{\partial
L}{\partial y_2^{(1)}} x_1^{(1)}& \frac{\partial L}{\partial y_2^{(1)}}
x_2^{(1)}& \frac{\partial L}{\partial y_2^{(1)}} x_3^{(1)}\end{bmatrix} \\
&= \begin{bmatrix} \frac{\partial L}{\partial y_1^{(1)}} \\ \frac{\partial
L}{\partial y_2^{(1)}}\end{bmatrix} \begin{bmatrix} x_1^{(1)} & x_2^{(1)} &
x_3^{(1)}\end{bmatrix}.
\end{align} %]]></script>
<p>We fortunately see that it simplifies to a simple matrix product! This seems to
suggest the following rule: try to simplify any expressions to straightforward
Jacobians, gradients, or scalar derivatives, and sum over as needed. Above,
splitting the components of <script type="math/tex">y^{(1)}</script> allowed us to utilize the derivative
<script type="math/tex">\frac{\partial y_i^{(1)}}{\partial W}</script> since <script type="math/tex">y_i^{(1)}</script> is now <em>a
real-valued function</em>, thus enabling straightforward gradient derivations. It
also meant the upstream derivative could be analyzed component-by-component,
making our lives easier.</p>
<p>A similar case holds for when we have multiple columns <script type="math/tex">x^{(i)}</script> in <script type="math/tex">X</script>. We
would have <em>another</em> sum above, over the columns, but fortunately this can be
re-written as matrix multiplication.</p>
<h3 id="convolutional-layers">Convolutional Layers</h3>
<p>How do we compute the convolutional layer gradients? That’s pretty complicated
so I’ll leave that as an exercise for the reader. For now.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:tensor">
<p>In fact, <script type="math/tex">X</script> is in general a <em>tensor</em>. Sophisticated software
packages will generalize <script type="math/tex">X</script> to be tensors. For example, we need to add
another dimension to <script type="math/tex">X</script> with image data since we’ll be using, say,
<script type="math/tex">28\times 28</script> data instead of <script type="math/tex">28\times 1</script> data (or <script type="math/tex">3\times 1</script> data
in my trivial example here). However, for the sake of simplicity and
intuition, I will deal with simple column vectors as samples within a matrix
<script type="math/tex">X</script>. <a href="#fnref:tensor" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 21 Jan 2017 10:00:00 +0000
https://danieltakeshi.github.io/2017/01/21/understanding-higher-order-local-gradient-computation-for-backpropagation-in-deep-neural-networks/
https://danieltakeshi.github.io/2017/01/21/understanding-higher-order-local-gradient-computation-for-backpropagation-in-deep-neural-networks/Keeper of the Olympic Flame: Lake Placid’s Jack Shea vs. Avery Brundage and the Nazi Olympics (Story of My Great-Uncle)<p style="text-align:center;"> <img src="https://danieltakeshi.github.io/assets/jack_shea.jpeg" alt="jack_shea" /> </p>
<p>I just read <a href="http://www.spotlightnews.com/tag/keeper-of-the-olympic-flame-lake-placids-jack-shea-vs-avery-brundage-and-the-nazi-olympics/"><em>Keeper of the Olympic Flame: Lake Placid’s Jack Shea vs. Avery
Brundage and the Nazi Olympics</em></a>. This is the story of Jack Shea, a
speed-skater from Lake Placid, NY, who won two gold medals in the 1932 Winter
Olympics (coincidentally, also in Lake Placid). Shea became a local hometown
hero and helped to put Lake Placid on the map.</p>
<p>Then, a few years later, Jack Shea boycotted the 1936 Winter Olympics since they
were held in Nazi Germany. Jack Shea believed – rightfully – that any regime
that discriminated against Jews to the extent the Nazis did had no right to host
such an event. Unfortunately, the man in charge of the decision, Avery Brundage,
had the last call and decided to include Americans in the Olympics. (Due to
World War II, The Winter Olympics would not be held again until 1948.) The book
discusses Shea’s boycott – including the striking letter he wrote to Brundage
– and then moves on to the 1980 Winter Olympics, which <em>also</em> was held in Lake
Placid.</p>
<p>I enjoyed reading <em>Keeper of the Olympic Flame</em> to learn more about the history
of the Winter Olympics and the intersection of athletics and politics.</p>
<p>The book also means a lot to me because Jack Shea was my great-uncle. For me,
the pictures and stories within it are riveting. As I read the book, I often
wondered about what life must have been like in those days, particularly for my
distant relatives and ancestors.</p>
<p>I only met Jack Shea once, at a funeral for his sister (my great-aunt). Jack
Shea died in a car accident in 2002 at the age of 91, <a href="http://readme.readmedia.com/Governor-Paterson-Signs-Jack-Shea-Bill-to-Combat-Drunken-Driving/1591401/print">presumably from a
36-year-old drunk motorist <em>who escaped prosecution</em></a>. This would be just
weeks before his grandson, <a href="https://en.wikipedia.org/wiki/Jimmy_Shea">Jimmy Shea</a>, won a gold meal in skeleton during
the 2002 Salt Lake City Winter Olympics. I still remember watching Jimmy win the
gold medal and showing everyone his picture of Jack Shea in his helmet. Later, I
would personally meet Jimmy and other relatives in a post-Olympics celebration.</p>
<p>I wish I had known Jack Shea better, as he seemed like a high-character
individual. I am glad that this book is here to partially make up for that.</p>
Wed, 11 Jan 2017 21:00:00 +0000
https://danieltakeshi.github.io/2017/01/11/keeper-of-the-olympic-flame-lake-placids-jack-shea-vs-avery-brundage-and-the-nazi-olympics-story-of-my-great-uncle/
https://danieltakeshi.github.io/2017/01/11/keeper-of-the-olympic-flame-lake-placids-jack-shea-vs-avery-brundage-and-the-nazi-olympics-story-of-my-great-uncle/The End of Identity Politics?<p>On November 18, 2016, there was a fantastic essay in the New York Times by Mark
Lilla of Columbia University called “<a href="https://www.nytimes.com/2016/11/20/opinion/sunday/the-end-of-identity-liberalism.html">The End of Identity Liberalism</a>”. This
essay will go down in history as one that I will remember for a long time.</p>
<p>I have long wanted to write something about identity politics, but I could never
find the time to research and eloquently describe my beliefs on such a sensitive
topic, so a concise reaction to Lilla’s essay will have to do for now.</p>
<p>Despite being a registered Democrat, one area where I seem to disagree with
liberals — at least if we can infer anything from the 2016 election, which
admittedly is asking for a lot — is over the issue of identity politics. I
personally feel uncomfortable at best about the practice of identity politics.
I also believe that, while identity politics obviously has well-meaning
<em>intentions</em>, it accelerates the development of undesirable side effects.</p>
<p>Exhibit A: the election of Donald Trump (well, undesirable to most liberals).</p>
<p>When I was reading Lilla’s essay, the following passage hit home to me:</p>
<blockquote>
<p>[Hillary Clinton] tended on the campaign trail to lose that large vision and
slip into the rhetoric of diversity, calling out explicitly to
African-American, Latino, L.G.B.T. and women voters at every stop. This was a
strategic mistake. If you are going to mention groups in America, you had
better mention all of them. If you don’t, those left out will notice and feel
excluded. Which, as the data show, was exactly what happened with the white
working class and those with strong religious convictions. Fully two-thirds of
white voters without college degrees voted for Donald Trump, as did over 80
percent of white evangelicals.</p>
</blockquote>
<p>While it is true that, if I had to pick any race to associate with the
“privileged” label, white Americans would be the easy choice. However, <a href="https://www.nytimes.com/2016/12/15/us/politics/democrats-joe-biden-hillary-clinton.html">as
suggested by Joe Biden</a>, it is difficult for the white working class to
associate themselves with privilege and with identity liberalism.</p>
<p>Aside from the working class whites, another group of people in America who I
believe “lack privilege” are people with disabilities. I also think that this
group fails to get sufficient recognition compared to other groups (relative to
population size). That is not to say that the group is <em>ignored</em>, but with
limited time and money, political parties have to selectively choose what to
promote and champion. Clinton talked about supporting disabled people in her
campaign, but this was probably more motivated from Trump’s actions than a
Democrat-led initiative to treat disabled people as a political group with
higher priority than others. (Again, it’s not about being <em>in favor of</em> or
<em>against of</em>, but about <em>the priority level</em>.)</p>
<p>After thinking about it, even though I might benefit from increased “identity
politics” towards people with disabilities, I <em>still</em> would probably feel
uncomfortable taking part or engaging in the practice, given that any such focus
on a group of people necessarily leaves out others.</p>
<p>A second reason why I would feel uncomfortable is that within voting blocks, we
are seeing increased diversity. According to exit polls of the 2016 election.
Trump <a href="http://slatestarcodex.com/2016/11/16/you-are-still-crying-wolf/"><em>actually made gains</em> among African Americans, Hispanic Americans, <em>and</em>
Asian Americans compared to Mitt Romney</a>! Sometimes I worry that the focus on
labeling groups of people has the effect that Lilla observes later:</p>
<blockquote>
<p>The surprisingly high percentage of the Latino vote that went to Mr. Trump
should remind us that the longer ethnic groups are here in this country, the
more politically diverse they become.</p>
</blockquote>
<p>I wouldn’t want my beliefs pigeonholed just because of who I am. (Sadly, I have
experienced several frustrating examples of this within the deaf community.)</p>
<p>Overall, I prefer the approach where policies are designed to bring <em>everyone</em>
up together, <em>so long as</em> they are given <em>an equal starting ground</em>. I know
that this is sadly not true in America yet. Therefore, if I <em>had</em> to support any
form of identity politics, it would be one which focuses chiefly on improving
the lives of children from low-income families within the first 5-10 years of
life under the critical junction of education and nutrition.</p>
<p>I am under no illusions that part of why I did well in school is that I didn’t
grow up in poverty. And indeed, being 50% Asian and 50% Caucasian might have
helped (though Asians are often not treated as minority groups, and mixed-race
people are often ignored in polls about race, but those are subjects for another
day).</p>
<p>Ultimately, identity politics is probably not going to make or break my life
goals at this point. My impetus in raising this discussion, however, is largely
about my concern over the future of many students and old friends that I know
from high school. Some come from low-income families and face challenges in
their lives which are not prioritized by the current identity liberalism. For
instance, I was shocked to learn this year when someone I knew from high school
was recently arrested for <em>attempted murder</em>. I still have access to his
Facebook page, and it’s heartbreaking to read. His wife posts pictures of him
and his child and keeps asking him to “hang in there” until he can return to the
family.</p>
<p>This guy is younger than me and his future already seems dashed. His biggest
challenge in life is probably that he shares my disability, so I worry that
people similar to him will never be able to escape out of their cycle of
poverty. I hope that there is a way that we can move towards an identity-free
future without the risk of alienating people like him, and also of course, with
the effect of increasing the economic possibilities and fairness for all of us.</p>
Wed, 11 Jan 2017 20:00:00 +0000
https://danieltakeshi.github.io/2017/01/11/the-end-of-identity-politics/
https://danieltakeshi.github.io/2017/01/11/the-end-of-identity-politics/