Seita's PlaceThis is my blog, where I have written over 225 articles on a variety of topics, most of which are about one of two major themes. The first is computer science, which is my area of specialty as a Ph.D. student at UC Berkeley. The second can be broadly categorized as "deafness," which relates to my experience and knowledge of being deaf.
https://danieltakeshi.github.io/
Fri, 26 May 2017 20:49:42 0700Fri, 26 May 2017 20:49:42 0700Jekyll v3.4.3Deep Reinforcement Learning (CS 294112) at Berkeley, Take Two<p>Back in Fall 2015, I took the first edition of <em>Deep Reinforcement Learning</em> (CS
294112) at Berkeley. As usual, I <a href="https://danieltakeshi.github.io/20151217reviewofdeepreinforcementlearningcs294112atberkeley/">wrote a blog post</a> about the class; you
can find more about other classes I’ve taken by <a href="https://danieltakeshi.github.io/archive.html">searching the archives</a>.</p>
<p>In that blog post, I admitted that CS 294112 had several weaknesses, and also
that I didn’t quite fully understand the material. Fast forward to today, and
I’m pleased to say that:</p>
<ul>
<li>
<p>There has been a second edition of CS 294112, taught this past spring
semester. It was a threecredit, full semester course and therefore more
substantive than the previous edition which was twocredits and lasted only
eight weeks. Furthermore, the slides, homework assignments, <em>and</em> the lecture
recordings are all publicly available online. Check out <a href="http://rll.berkeley.edu/deeprlcourse/">the course
website</a> for details. You can find the homework assignments <a href="https://github.com/berkeleydeeprlcourse/homework">in this GitHub
repository</a> (I had to search a bit for this).</p>
</li>
<li>
<p>I now understand much more about deep reinforcement learning and about how to
use TensorFlow.</p>
</li>
</ul>
<p>These developments go hand in hand, because I spent much of the second half of
the Spring 2017 semester selfstudying the second edition of CS 294112. (To be
clear, I was not enrolled in the class.) I know I said I would first selfstudy
a few other courses <a href="https://danieltakeshi.github.io/20160220thefourclassesthatihaveselfstudied/">in a previous blog post</a>, but I couldn’t pass up such a
prime opportunity to learn about deep reinforcement learning. Furthermore, the
field moves so fast that I worried that if I didn’t follow what was happening
<em>now</em>, I would <em>never</em> be able to catch up to the research frontier if I tried
to do so in a year.</p>
<p>The class had four homework assignments, and I completed all of them with the
exception of skipping the DAgger algorithm implementation in the first homework.
The assignments were extremely helpful for me to understand how to better use
TensorFlow, and I finally feel comfortable using it for my personal projects.
If I can spare the time (famous last words) I plan to write some
TensorFlowrelated blog posts.</p>
<p>The video lecture were a nice bonus. I only watched a fraction of them, though.
This was in part due to time constraints, but also in part due to the lack of
captions. The lecture recordings are on YouTube, and in YouTube, I can turn on
automatic captions which helps me to follow the material. However, some of the
videos didn’t enable that option, so I had to skip those and just read the
slides since I wasn’t following what was being said. As far as I remember,
automatic captions are provided as an option so long as whoever uploaded the
video enables some setting, so maybe someone forgot to do so? Fortunately, the
lecture video on policy gradients has captions enabled, so I was able to watch
that one. Oh, and <a href="https://danieltakeshi.github.io/2017/03/28/goingdeeperintoreinforcementlearningfundamentalsofpolicygradients/">I wrote a blog post about the material</a>.</p>
<p>Another possible downside to the course, though this one is extremely minor, is
that the last few class sessions were <em>not</em> recorded, since those were when
students presented their final projects. Maybe the students wanted some level of
privacy? Oh well, I suppose there’s way too many other interesting projects
available anyway (by searching GitHubs, arXiv preprints, etc.) to worry about
this thing.</p>
<p>I want to conclude with a huge thank you to the course staff. Thank you for
helping to spread knowledge about deep reinforcement learning with a great class
and with lots of publicly available material. I really appreciate it.</p>
Wed, 24 May 2017 13:00:00 0700
https://danieltakeshi.github.io/2017/05/24/deepreinforcementlearningcs294112atberkeleytaketwo
https://danieltakeshi.github.io/2017/05/24/deepreinforcementlearningcs294112atberkeleytaketwoAlan Turing: The Enigma<p>I finished reading Andrew Hodges’ book <em>Alan Turing: The Engima</em>, otherwise
known as the definitive biography of mathematician, computer scientist, and code
breaker Alan Turing. I was inspired to read the book in part because I’ve been
reading lots of AIrelated books this year<sup id="fnref:reading_list"><a href="#fn:reading_list" class="footnote">1</a></sup> and in just about
every one of those books, Alan Turing is mention in some form. In addition, I
saw the film <em>The Imitation Game</em>, and indeed this is the book that inspired it.
I bought the 2014 edition of the book — with <em>The Imitation Game</em> cover —
during a recent visit to the <a href="https://www.nsa.gov/about/cryptologicheritage/museum/">National Cryptology Museum</a>.</p>
<p>The author is Andrew Hodges, who at that time was a mathematics instructor at
the University of Oxford (he’s now retired). He maintains a website where he
commemorates Alan Turing’s life and achievements. I encourage the interested
reader to <a href="http://www.turing.org.uk/index.html">check it out</a>. Hodges has the qualifications to write about the
book, being deeply versed in mathematics. He also appears to be gay
himself.<sup id="fnref:just_saying"><a href="#fn:just_saying" class="footnote">2</a></sup></p>
<p>After reading the book, my immediate thoughts relating to the <em>positive</em> aspects
of the books are:</p>
<ul>
<li>
<p>The book is organized chronologically and the eight chapters are indicated
with date ranges. Thus, for a biography of this size, it is relatively
straightforward to piece together a mental timeline of Alan Turing’s life.</p>
</li>
<li>
<p>The book is <em>detailed</em>. Like, <em>wow</em>. The edition I have is 680 pages, not
counting the endnotes at the back of the book which command an extra 30 or so
pages. Since I read almost every word of this book (I skipped a few endnotes),
and because I tried to stay alert when reading this book, I felt like I got a
clear picture of Turing’s life, along with what life must have been like
during the World War IIera.</p>
</li>
<li>
<p>The book contains quotes and writings from Turing that show just how far ahead
of his time he was. For instance, even today people are still utilizing
concepts from his famous 1936 paper <em>On Computable Numbers, with an
Application to the Entscheidungsproblem</em> and his 1950 paper <em>Computing
Machinery and Intelligence</em>. The former introduced Turing Machines, the latter
introduced the famous <a href="https://en.wikipedia.org/wiki/Turing_test">Turing Test</a>. Fortunately, I don’t think there was
much exaggeration of Turing’s accomplishments, unlike the <em>The Imitation
Game</em>. When I was reading his quotes, I often had to remind myself that “this
is the 1940s or 1950s ….”</p>
</li>
<li>
<p>The book showcases the struggles of being gay, particularly during a time when
homosexual activity was a crime. The book actually doesn’t seem to cover some
of his struggles in the early 1950s as much as I thought it would be, but it
was probably difficult to find sufficient references for this aspect of his
life. At the very least, readers today should appreciate how much our attitude
towards homosexuality has improved.</p>
</li>
</ul>
<p>That’s not to say there weren’t a few downsides. Here are some I thought of:</p>
<ul>
<li>
<p>Related to what I mentioned earlier, it is <em>long</em>. It too me a month to
finish, and the writing is in “1983style” which makes it more difficult for
me to understand. (By contrast, I read <em>both</em> of Richard Dawkins’ recent
autobiographies, which combine to be roughly the same length as Hodges’ book,
and Dawkins’ books were much easier to read.) Now, I find Turing’s life very
interesting so this is more of a “neutral” factor to me, but I can see why the
casual reader might be dissuaded from reading this book.</p>
</li>
<li>
<p>Much of the material is technical even to me. I understand the basics of
Turing Machines but certainly not how the early computers were built. The
hardest parts of the book to read are probably in chapters six and seven (out
of eight total). I kept asking to myself “what’s a cathode ray”?</p>
</li>
</ul>
<p>To conclude, the book is an extremely detailed overview of Turing’s life which
at times may be technically challenging to read.</p>
<p>I wonder what Alan Turing would think about AI today. The widelyused AI
undergraduate textbook by Stuart Russell and Peter Norvig concludes with the
follow prescient quote by Turing:</p>
<blockquote>
<p>We can only see a short distance ahead, but we can see plenty there that needs
to be done.</p>
</blockquote>
<p>Earlier scientists have an advantage in setting their legacy in their fields
since it’s easier to make landmark contributions. I view Charles Darwin, for
instance, as the greatest biologist who has ever lived, and no matter how
skilled today’s biologists are, I believe none will ever be able to surpass
Darwin’s impact. The same goes today for Alan Turing, who (possibly along with
John von Neumann) is one of the two preeminent computer scientists who has ever
lived.</p>
<p>Despite all the talent that’s out there in computer science, I don’t think any
one individual can possibly surpass Turing’s legacy on computer science and
artificial intelligence.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:reading_list">
<p>Thus, the 2017 edition of my reading list post (<a href="https://danieltakeshi.github.io/2016/12/31/allthebooksireadin2016plusmythoughtslong">here’s the
2016 version, if you’re wondering</a>) is going to be <em>very</em> biased in terms
of AI. Stay tuned! <a href="#fnref:reading_list" class="reversefootnote">↩</a></p>
</li>
<li id="fn:just_saying">
<p>I only say this because people who are members of “certain
groups” — where membership criteria is not due to choice but due to
intrinsic human characteristics — tend to have more knowledge about the
group than “outsiders.” Thus, a gay person by default has extra credibility
when writing about being gay than would a straight person. A deaf person by
default has extra credibility when writing about deafness than a hearing
person. And so on. <a href="#fnref:just_saying" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 21 May 2017 12:00:00 0700
https://danieltakeshi.github.io/2017/05/21/alanturingtheenigma
https://danieltakeshi.github.io/2017/05/21/alanturingtheenigmaUnderstanding Deep Learning Requires Rethinking Generalization: My Thoughts and Notes<p>The paper “Understanding Deep Learning Requires Rethinking Generalization”
(<a href="https://arxiv.org/abs/1611.03530">arXiv link</a>) caused quite a stir in the Deep Learning and Machine Learning
research communities. It’s the rare paper that seems to have high research merit
— judging from being awarded one of three <em>Best Paper</em> awards at <a href="http://www.iclr.cc/doku.php?id=ICLR2017:main&redirect=1">ICLR
2017</a> — but is <em>also</em> readable. Hence, it got the most amount of comments
of any ICLR 2017 submission on <a href="https://openreview.net/forum?id=Sy8gdB9xx&noteId=Sy8gdB9xx">OpenReview</a>. It has also been discussed on
<a href="https://www.reddit.com/r/MachineLearning/comments/5cw3lr/r_161103530_understanding_deep_learning_requires/">reddit</a> and was recently featured on <a href="https://blog.acolyer.org/2017/05/11/understandingdeeplearningrequiresrethinkinggeneralization/"><em>The Morning Paper</em></a> blog. I was
aware of the paper shortly after it was uploaded to arXiv, but never found the
time to read it in detail until now.</p>
<p>I enjoyed reading the paper, and while I agree with many readers that some of
the findings might be obvious, the paper nonetheless seems deserving of the
attention it has been getting.</p>
<p>The authors conveniently put two of their important findings in centered
italics:</p>
<blockquote>
<p>Deep neural networks easily fit random labels.</p>
</blockquote>
<p>and</p>
<blockquote>
<p>Explicit regularization may improve generalization performance, but is neither
necessary nor by itself sufficient for controlling generalization error.</p>
</blockquote>
<p>I will also quote another contribution from the paper that I find interesting:</p>
<blockquote>
<p>We complement our empirical observations with a theoretical construction
showing that generically large neural networks can express any labeling of the
training data.</p>
</blockquote>
<p>(I go through the derivation later in this post.)</p>
<p>Going back to their first claim about deep neural networks fitting random
labels, what does this mean from a <em>generalization perspective</em>?
(Generalization is just the difference between training error and testing
error.) It means that we cannot come up with a “generalization function” that
can take in a neural network as input and output a generalization quality score.
Here’s my intuition:</p>
<ul>
<li>
<p><strong>What we want</strong>: let’s imagine an arbitrary encoding of a neural network
designed to give as much deterministic information as possible, such as the
architecture and hyperparameters, and then use that encoding as input to a
generalization function. We want that function to give us a number
representing generalization quality, assuming that the datasets are allowed to
vary. The worst generalization occurs when a fixed neural network gets
excellent training error but could get either the <em>same</em> testing error
(awesome!), or get testset performance no better than <em>random guessing</em>
(ugh!).</p>
</li>
<li>
<p><strong>Reality</strong>: unfortunately, the best we can do seems to be no better than the
worst case. We know of no function that can provide bounds on generalization
performance across all datasets. Why? Let’s use the LeNet architecture and
MNIST as an example. With the right architecture, generalization error is very
small as both training and testing performance are in the high 90 percentages.
With a second data set that consists of the <em>same</em> MNIST digits, but with the
<em>labels randomized</em>, that same LeNet architecture can do no better than random
guessing on the test set, even though the <em>training</em> performance is extremely
good (or at least, it should be). That’s <em>literally</em> as bad as we can get.
There’s no point in developing a function to measure generalization when we
know it can only tell us that generalization will be in between zero (i.e.
perfect) and the difference between zero and random guessing (i.e. the worst
case)!</p>
</li>
</ul>
<p>As they later discuss in the paper, regularization can be used to improve
generalization, but will not be sufficient for developing our desired
generalization criteria.</p>
<p>Let’s briefly take a step back and consider classical machine learning, which
provides us with generalization criteria such as VCdimension, Rademacher
complexity, and uniform stability. I learned about VCdimension during my
undergraduate machine learning class, Rademacher complexity during STAT 210B
this past semester, and … actually I’m not familiar with uniform stability.
But <em>intuitively</em> … it makes sense to me that classical criteria do not apply
to deep networks. To take the Rademacher complexity example: a function class
which can fit to arbitrary <script type="math/tex">\pm 1</script> noise vectors presents the trivial bound of
one, which is like saying: “generalization is between zero and the worst case.”
Not very helpful.</p>
<p>The paper then proceeds to describe their testing scenario, and packs some
important results in the figure reproduced below:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/understanding_dl_rethinking_gen.png" /> </p>
<p>This figure represents a neural network classifying the images in the
widelybenchmarked CIFAR10 dataset. The network the authors used is a
simplified version of the Inception architecture.</p>
<ul>
<li>
<p>The first subplot represents five different settings of the labels and input
images. To be clear on what the “gaussian” setting means, they use a Gaussian
distribution to generate <em>random pixels</em> (!!) for every image. The mean and
variance of that Gaussian are “matched to the original dataset.” In addition,
the “shuffled” and “random” pixels apply a random permutation to the pixels,
with the <em>same</em> permutation to all images for the former, and <em>different</em>
permutations for the latter.</p>
<p>We immediately see that the neural network can get zero training error on
<em>all</em> the settings, but the convergence speed varies. Intuition suggests that
the dataset with the correct labels and the one with the same shuffling
permutation should converge quickly, and this indeed is the case.
Interestingly enough, I thought the “gaussian” setting would have the worst
performance, but that prize seems to go to “random labels.”</p>
</li>
<li>
<p>The second subplot measures training error when the amount of label noise is
varied; with some probability <script type="math/tex">p</script>, each image independently has its labeled
corrupted and replaced with a draw from the discrete uniform distribution over
the classes. The results show that more corruption slows convergence, which
makes sense. By the way, using a continuum of something is a common research
tactic and something I should try for my own work.</p>
</li>
<li>
<p>Finally, the third subplot measures generalization error under label
corruption. As these data points were all measured <em>after</em> convergence, this
is equivalent to the test error. The results here also make a lot of sense.
Test set error should be approaching 90 percent because CIFAR10 has 10
classes (that’s why it’s called CIFAR10!).</p>
</li>
</ul>
<p>My major criticism of this figure is <em>not</em> that the results, particularly in the
second and third subplots, might seem obvious but that the figure <em>lacks error
bars</em>. Since it’s easy nowadays to program multiple calls in a bash script or
something similar, I would expect at least three trials and with error bars (or
“regions”) to each curve in this figure.</p>
<p>The next section discusses the role of regularization, which is normally
applied to prevent overfitting to the training data. The classic example is with
linear regression and a dataset of several points arranged in roughly a linear
fashion. Do we try to fit a straight line through these points, which might have
lots of training error, or do we take a highdimensional polynomial and fit
<em>every</em> point exactly, even if the resulting curve looks impossibly crazy?
That’s what regularization helps to control. Explicit regularization in linear
regression is the <script type="math/tex">\lambda</script> term in the following optimization problem:</p>
<script type="math/tex; mode=display">\min_w \Xw  y\_2^2 + \lambda \w\_2^2</script>
<p>I presented this <a href="https://danieltakeshi.github.io/2016/08/05/ausefulmatrixinverseequalityforridgeregression/">in an earlier blog post</a>.</p>
<p>To investigate the role of regularization in Deep Learning, the authors test
with and without regularizers. Incidentally, the use of <script type="math/tex">\lambda</script> above is not
the only type of regularization. There are also several others: <strong>data
augmentation</strong>, <strong>dropout</strong>, <strong>weight decay</strong>, <strong>early stopping</strong> (implicit) and
<strong>batch normalization</strong> (implicit). These are standard tools in the modern Deep
Learning toolkit.</p>
<p>They find that, while regularization helps to improve generalization
performance, it is still possible to get excellent generalization even with <em>no</em>
regularization. They conclude:</p>
<blockquote>
<p>In summary, our observations on both explicit and implicit regularizers are
consistently suggesting that regularizers, when properly tuned, could help to
improve the generalization performance. However, it is unlikely that the
regularizers are the fundamental reason for generalization, as the networks
continue to perform well after all the regularizers [are] removed.</p>
</blockquote>
<p>On a side note, the regularization discussion in the paper feels out of order
and the writing sounds a bit off to me. I wish they had more time to fix this,
as the regularization portion of the paper contains most of my English
languagerelated criticism.</p>
<p>Moving on, the next section of the paper is about <strong>finitesample
expressivity</strong>, or understanding what functions neural networks can express
<em>given a finite number of samples</em>. The authors state that the previous
literature focuses on <em>population analysis</em> where one can assume an arbitrary
number of samples. Here, instead, they assume a <em>fixed</em> set of <script type="math/tex">n</script> training
points <script type="math/tex">\{x_1,\ldots,x_n\}</script>. This seems easier to understand anyway.</p>
<p>They prove a theorem that relates to the third major contribution I wrote
earlier: “that generically large neural networks can express any labeling of the
training data.” Before proving the theorem, let’s begin with the following
lemma:</p>
<blockquote>
<p><strong>Lemma 1.</strong> For any two interleaving sequences of <script type="math/tex">n</script> real numbers</p>
<script type="math/tex; mode=display">% <![CDATA[
b_1 < x_1 < b_2 < x_2 \cdots < b_n < x_n %]]></script>
<p>the <script type="math/tex">n \times n</script> matrix <script type="math/tex">A = [\max\{x_i  b_j, 0\}]_{ij}</script> has full rank.
Its smallest eigenvalue is <script type="math/tex">\min_i (x_i  b_i)</script>.</p>
</blockquote>
<p>Whenever I see statements like these, my first instinct is to draw out the
matrix. And here it is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
A &=
\begin{bmatrix}
\max\{x_1b_1, 0\} & \max\{x_1b_2, 0\} & \cdots & \max\{x_1b_n, 0\} \\
\max\{x_2b_1, 0\} & \max\{x_2b_2, 0\} & \cdots & \max\{x_2b_n, 0\} \\
\vdots & \ddots & \ddots & \vdots \\
\max\{x_nb_1, 0\} & \max\{x_nb_2, 0\} & \cdots & \max\{x_nb_n, 0\}
\end{bmatrix} \\
&\;{\overset{(i)}{=}}\;
\begin{bmatrix}
x_1b_1 & 0 & 0 & \cdots & 0 \\
x_2b_1 & x_2b_2 & 0 & \cdots & 0 \\
\vdots & \ddots & \ddots & \ddots & \vdots \\
x_{n1}b_1 & x_{n1}b_2 & \ddots & \cdots & 0 \\
x_nb_1 & x_nb_2 & x_nb_3 & \cdots & x_nb_n
\end{bmatrix}
\end{align} %]]></script>
<p>where (i) follows from the interleaving sequence assumption. This matrix is
lowertriangular, and moreover, all the nonzero elements are positive. We know
from linear algebra that lower triangular matrices</p>
<ul>
<li>are invertible if and only if the diagonal elements are nonzero</li>
<li>have their eigenvalues taken directly from the diagonal elements</li>
</ul>
<p>These two facts together prove Lemma 1. Next, we can prove:</p>
<blockquote>
<p><strong>Theorem 1</strong>. There exists a twolayer neural network with ReLU activations
and <script type="math/tex">2n + d</script> weights that can represent any function on a sample of size
<script type="math/tex">n</script> in <script type="math/tex">d</script> dimensions.</p>
</blockquote>
<p>Consider the function</p>
<script type="math/tex; mode=display">c(x) = \sum_{j=1}^n w_j \cdot \max\{a^Txb_j,0\}</script>
<p>with <script type="math/tex">w, b \in \mathbb{R}^n</script> and <script type="math/tex">a,x\in \mathbb{R}^d</script>. (There’s a typo in
the paper, <script type="math/tex">c</script> is a function from <script type="math/tex">\mathbb{R}^d\to \mathbb{R}</script>, not
<script type="math/tex">\mathbb{R}^n\to \mathbb{R}</script>). This can certainly be represented by a depth2
ReLU network. To be clear on <a href="http://cs231n.github.io/neuralnetworks1/">the naming convention</a>, “depth2” does not
count the input layer, so our network should only have one ReLU layer in it as
the output shouldn’t have ReLUs applied to it.</p>
<p>Here’s how to think of the network representing <script type="math/tex">c</script>. First, assume that we
have a <em>minibatch</em> of <script type="math/tex">n</script> elements, so that <script type="math/tex">X</script> is the <script type="math/tex">n\times d</script> data
matrix. The depth2 network representing <script type="math/tex">c</script> can be expressed as:</p>
<script type="math/tex; mode=display">% <![CDATA[
c(X) =
\max\left(
\underbrace{\begin{bmatrix}
\texttt{} & x_1 & \texttt{} \\
\vdots & \vdots & \vdots \\
\texttt{} & x_n & \texttt{} \\
\end{bmatrix}}_{n\times d}
\underbrace{\begin{bmatrix}
\mid & & \mid \\
a & \cdots & a \\
\mid & & \mid
\end{bmatrix}}_{d \times n}

\underbrace{\begin{bmatrix}
b_1 & \cdots & b_n
\end{bmatrix}}_{1\times n}
, \;\;
\underbrace{\begin{bmatrix}
0 & \cdots & 0
\end{bmatrix}}_{1\times n}
\right)
\cdot
\begin{bmatrix}
w_1 \\ \vdots \\ w_n
\end{bmatrix} %]]></script>
<p>where <script type="math/tex">b</script> and the zerovector used in the maximum “broadcast” as necessary in
Python code.</p>
<p>Given a fixed dataset <script type="math/tex">S=\{z_1,\ldots,z_n\}</script> of distinct inputs with labels
<script type="math/tex">y_1,\ldots,y_n</script>, we must be able to find settings of <script type="math/tex">a,w,</script> and <script type="math/tex">b</script> such
that <script type="math/tex">c(z_i)=y_i</script> for all <script type="math/tex">i</script>. You might be guessing how we’re doing this:
we must reduce this to the interleaving property in Lemma 1. Due to the
uniqueness of the <script type="math/tex">z_i</script>, it is possible to find <script type="math/tex">a</script> to make the
<script type="math/tex">x_i=z_i^Ta</script> terms satisfy the interleaving property. Then we have a full rank
solution, hence <script type="math/tex">y=Aw</script> results in <script type="math/tex">w^* = A^{1}y</script> as our final weights,
where <script type="math/tex">A</script> is precisely that matrix from Lemma 1! We also see that, indeed,
there are <script type="math/tex">n+n+d</script> weights in the network. This is an interesting and fun
proof, and I think variants of this question would work well as a homework
assignment for a Deep Learning class.</p>
<p>The authors conclude the paper by trying to understand generalization with
<em>linear</em> models, in the hope that some of the intuition will transfer over to
the Deep Learning setting. With linear models, given some weights <script type="math/tex">w</script>
resulting from the optimization problem, what can we say about generalization
just by looking at it? Curvature is one popular metric to understand the
<em>quality</em> of the minima (which is not necessarily the same as the generalization
criteria!), but the Hessian is independent of <script type="math/tex">w</script>, so in fact it seems
impossible to use curvature for generalization. I’m convinced this is true for
the normal mean square loss, but is this still true if the loss function were,
say, the <em>cube</em> of the <script type="math/tex">L_2</script> difference? After all, there are only two
derivatives applied on <script type="math/tex">w</script>, right?</p>
<p>The authors instead urge us to think of stochastic gradient descent instead of
curvature when trying to measure quality. Assuming that <script type="math/tex">w_0=0</script>, the
stochastic gradient descent update consists of a series of “linear combination”
updates, and hence the result is just a linear combination <em>of</em> linear
combinations <em>of</em> linear combinations … (and so forth) … which at the end of
the day, remains a linear combination. (I don’t think they need to assume
<script type="math/tex">w_0=0</script> if we can add an extra 1 to all the data points.) Consequently, they
can fit any set of labels of the data by solving a linear equation, and indeed,
they get strong performance on MNIST and CIFAR10, even <em>without</em>
regularization.</p>
<p>They next try to relate this to a minimum norm interpretation, though this is
not a fruitful direction because their results are worse when they try to find
minimum norm solutions. On MNIST, their best solution using some “Gabor wavelet
transform” (what?), is twice as better as the minimum norm solution. I’m not
sure how much stock to put into this section, other than how I like their
perspective of thinking of SGD as an implicit regularizer (like batch
normalization) rather than an optimizer. The line between the categories is
blurring.</p>
<p>To conclude, from my growing experience with Deep Learning, I don’t find their
experimental results surprising. That’s not to say the paper was entirely
predictable, but think of it this way: if I were a computer vision researcher
preAlexNet, I would be <em>more</em> surprised at reading the AlexNet paper as I am
today reading this paper. Ultimately, as I mentioned earlier, I enjoyed this
paper, and while it was predictable (that word again…) that it couldn’t offer
any <em>solutions</em>, perhaps it will be useful as a starting point to understanding
generalization in Deep Learning.</p>
Fri, 19 May 2017 01:00:00 0700
https://danieltakeshi.github.io/2017/05/19/understandingdeeplearningrequiresrethinkinggeneralizationmythoughtsandnotes
https://danieltakeshi.github.io/2017/05/19/understandingdeeplearningrequiresrethinkinggeneralizationmythoughtsandnotesMathematical Tricks Commonly Used in Machine Learning and Statistics<p>I have passionately studied various machine learning and statistical concepts
over the last few years. One thing I’ve learned from all this is that there are
many mathematical “tricks” involved, whether or not they are explicitly stated.
(In research papers, such tricks are often used without acknowledgment since it
is assumed that anyone who can benefit from reading the paper has the
mathematical maturity to fill in the details.) I thought it would be useful for
me, and hopefully for a few interested readers, to catalogue a set of the common
tricks here, and to see them applied in a few examples.</p>
<p>The following list, in alphabetical order, is a nonexhaustive set of tricks
that I’ve seen:</p>
<ul>
<li>CauchySchwarz</li>
<li>Integrating Probabilities into Expectations</li>
<li>Introducing an Independent Copy</li>
<li>Jensen’s Inequality</li>
<li>Law of Iterated Expectation</li>
<li>Lipschitz Functions</li>
<li>Markov’s Inequality</li>
<li>Norm Properties</li>
<li>Series Expansions (e.g. Taylor’s)</li>
<li>Stirling’s Approximation</li>
<li>Symmetrization</li>
<li>Take a Derivative</li>
<li>Union Bound</li>
<li>Variational Representations</li>
</ul>
<p>If the names are unclear or vague, the examples below should clarify. All the
tricks are used except for the law of iterated expectation, i.e.
<script type="math/tex">\mathbb{E}[\mathbb{E}[XY]] = \mathbb{E}[X]</script>. (No particular reason for that
omission; it just turns out the exercises I’m interested in didn’t require it.)</p>
<h2 id="example1maximumofnotnecessarilyindependentsubgaussians">Example 1: Maximum of (Not Necessarily Independent!) subGaussians</h2>
<p>I covered this problem in <a href="https://danieltakeshi.github.io/2017/04/22/followingprofessormichaeljordansadviceyourbrainneedsexercise">my last post here</a> so I will not repeat the
details. However, there are two extensions to that exercise which I thought would
be worth noting.</p>
<p><strong>First</strong>, To prove an upper bound for the random variable <script type="math/tex">Z =
\max_{i=1,2,\ldots,n}X_i</script>, it suffices to proceed as we did earlier in the
nonabsolute value case, but <em>augment</em> our subGaussian variables
<script type="math/tex">X_1,\ldots,X_n</script> with the set <script type="math/tex">X_1,\ldots,X_n</script>. It’s OK to do this because
no independence assumptions are needed. Then it turns out that an upper bound
can be derived as</p>
<script type="math/tex; mode=display">\mathbb{E}[Z] \le 2\sqrt{\sigma^2 \log n}</script>
<p>This is the same as what we had earlier, except the “2” is now outside the
square root. It’s quite intuitive.</p>
<p><strong>Second</strong>, consider how we can prove the following bound:</p>
<script type="math/tex; mode=display">\mathbb{P}\Big[Z \ge 2\sqrt{\sigma^2 \log n} + \delta\Big] \le 2e^{\frac{\delta^2}{2\sigma^2}}</script>
<p>We start by applying the standard technique of multiplying by <script type="math/tex">\lambda>0</script>,
exponentiating and then applying Markov’s Inequality with our nonnegative
random variable <script type="math/tex">e^{\lambda Z}</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbb{P}\left[Z \ge 2\sqrt{\sigma^2 \log n}+\delta\right] &= \mathbb{P}\left[e^{\lambda Z} \ge e^{\lambda (2\sqrt{\sigma^2 \log n} +\delta)}\right] = \\
&\le \mathbb{E}[e^{\lambda Z}]e^{\lambda 2\sqrt{\sigma^2 \log n}} e^{\lambda \delta} \\
&{\overset{(i)}\le}\; 2n \exp\left(\frac{\lambda^2\sigma^2}{2}\lambda\Big(\delta+ 2\sqrt{\sigma^2 \log n}\Big)\right) \\
&{\overset{(ii)}\le}\; 2n\exp\left(\frac{1}{2\sigma^2}\Big(\delta+ 2\sqrt{\sigma^2 \log n}\Big)^2\right) \\
&= 2 \exp\left(\frac{1}{2\sigma^2}\left[2\sigma^2 \log n + \delta^2 + 4\delta \sqrt{\sigma^2\log n} + 4\sigma^2\log n \right]\right)
\end{align*} %]]></script>
<p>where in (i) we used a bound previously determined in our bound on
<script type="math/tex">\mathbb{E}[Z]</script> (it came out of an intermediate step), and then used the fact
that the term in the exponential is a convex quadratic to find the minimizer
value <script type="math/tex">\lambda^* = \frac{\delta+2\sqrt{\sigma^2 \log n}}{\sigma^2}</script> via
differentiation in (ii).</p>
<p>At this point, to satisfy the desired inequality, we compare terms in the
exponentials and claim that with <script type="math/tex">\delta \ge 0</script>,</p>
<script type="math/tex; mode=display">2\sigma^2 \log n + 4\delta \sqrt{\sigma^2\log n} + \delta^2 \ge \delta^2</script>
<p>This will result in our desired bound. It therefore remains to prove
this, but it reduces to checking that</p>
<script type="math/tex; mode=display">2\sigma^2 \log n + 4\delta \sqrt{\sigma^2\log n} \ge 0</script>
<p>and the left hand side is nonnegative. Hence, the desired bound holds.</p>
<p>Tricks used:</p>
<ul>
<li>Jensen’s Inequality</li>
<li>Markov’s Inequality</li>
<li>Take a Derivative</li>
<li>Union Bound</li>
</ul>
<p><strong>Comments</strong>: My earlier blog post (along with this one) shows what I mean when
I say “take a derivative.” It happens when there is an upper bound on the right
hand side and we have a free parameter <script type="math/tex">\lambda \in \mathbb{R}</script> (or <script type="math/tex">\lambda
\ge 0</script>) which we can optimize to get the <em>tighest</em> possible bound. Often times,
such a <script type="math/tex">\lambda</script> is <em>explicitly introduced</em> via Markov’s Inequality, as we
have here. Just make sure to double check that when taking a derivative, you’re
getting a <em>minimum</em>, not a maximum. In addition, Markov’s Inequality can only be
applied to <em>nonnegative</em> random variables, which is why we often have to
exponentiate the terms inside a probability statement first.</p>
<p>Note the use of convexity of the exponential function. It is <em>very common</em> to
see Jensen’s inequality applied with the exponential function. Always remember
that <script type="math/tex">e^{\mathbb{E}[X]} \le \mathbb{E}[e^X]</script>!!</p>
<p>The procedure that I refer to as the “union bound” when I bound a maximum by a
sum isn’t exactly the canonical way of doing it, since that typically involves
probabilities, but it has a similar flavor. More formally, the <a href="https://en.wikipedia.org/wiki/Boole%27s_inequality">union bound</a>
states that</p>
<script type="math/tex; mode=display">\mathbb{P}\left[\cup_{i=1}^n A_i\right] \le \sum_{i=1}^n \mathbb{P}\left[A_i\right]</script>
<p>for countable sets of events <script type="math/tex">A_1,A_2,\ldots</script>. When we define a set of events
based on a maximum of certain variables, that’s the same as taking the union of
the individual events.</p>
<p>On a final note, be on the lookout for applications of this type whenever a
“maximum” operation is seen with something that resembles Gaussians. Sometimes
this can be a bit subtle. For instance, it’s not uncommon to use a bound of the
form above when dealing with <script type="math/tex">\mathbb{E}[\w\_\infty]</script>, the expectation of
the <script type="math/tex">L_\infty</script>norm of a standard Gaussian vector. In addition, when dealing
with sparsity, often our “<script type="math/tex">n</script>” or “<script type="math/tex">d</script>” is actually something like <script type="math/tex">{d
\choose s}</script> or another combinatoricsstyle value. Seeing a “log” accompanied by
a square root is a good clue and may help identify such cases.</p>
<h2 id="example2boundedrandomvariablesaresubgaussian">Example 2: Bounded Random Variables are SubGaussian</h2>
<p>This example is really split into two parts. The first is as follows:</p>
<blockquote>
<p>Prove that Rademacher random variables are subGaussian with parameter
<script type="math/tex">\sigma = 1</script>.</p>
</blockquote>
<p>The next is:</p>
<blockquote>
<p>Prove that if <script type="math/tex">X</script> is a zeromean and has support <script type="math/tex">X \in [a,b]</script>, then <script type="math/tex">X</script>
is subGaussian with parameter (at most) <script type="math/tex">\sigma = ba</script>.</p>
</blockquote>
<p>To prove the first part, let <script type="math/tex">\varepsilon</script> be a Rademacher random variable.
For <script type="math/tex">\lambda \in \mathbb{R}</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}[e^{\lambda \epsilon}] \;&{\overset{(i)}{=}}\; \frac{1}{2}\left(e^{\lambda} + e^{\lambda}\right) \\
\;&{\overset{(ii)}{=}}\; \frac{1}{2}\left( \sum_{k=0}^\infty \frac{(\lambda)^k}{k!} + \sum_{k=0}^\infty \frac{\lambda^k}{k!}\right) \\
\;&{\overset{(iii)}{=}}\; \sum_{k=0}^\infty \frac{\lambda^{2k}}{(2k)!} \\
\;&{\overset{(iv)}{\le}}\; \sum_{k=0}^\infty \frac{\lambda^{2k}}{2^kk!} \\
\;&{\overset{(v)}{=}}\; e^{\frac{\lambda^2}{2}},
\end{align} %]]></script>
<p>and thus the claim is satisfied by the definition of a subGaussian random
variable. In (i), we removed the expectation by using facts from
Rademacher random variables, in (ii) we used the series expansions of the
exponential function, in (iii) we simplified by removing the odd powers, in (iv)
we used the clever trick that <script type="math/tex">2^kk! \le (2k)!</script>, and in (v) we <em>again</em> used
the exponential function’s power series.</p>
<p>To prove the next part, observe that for any <script type="math/tex">\lambda \in \mathbb{R}</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}_{X}[e^{\lambda X}] \;&{\overset{(i)}{=}}\; \mathbb{E}_{X}\Big[e^{\lambda (X  \mathbb{E}_{X'}[X'])}\Big] \\
\;&{\overset{(ii)}{\le}}\; \mathbb{E}_{X,X'}\Big[e^{\lambda (X  X')}\Big] \\
\;&{\overset{(iii)}{=}}\; \mathbb{E}_{X,X',\varepsilon}\Big[e^{\lambda \varepsilon(X  X')}\Big] \\
\;&{\overset{(iv)}{\le}}\; \mathbb{E}_{X,X'}\Big[e^{\frac{\lambda^2 (X  X')^2}{2}}\Big] \\
\;&{\overset{(v)}{\le}}\; e^{\frac{\lambda^2(ba)^2}{2}},
\end{align} %]]></script>
<p>which shows by definition that <script type="math/tex">X</script> is subGaussian with parameter <script type="math/tex">\sigma =
ba</script>. In (i), we cleverly introduce <em>an extra independent copy</em> <script type="math/tex">X'</script> inside
the exponent. It’s zeromean, so we can insert it there without issues.<sup id="fnref:miller"><a href="#fn:miller" class="footnote">1</a></sup>
In (ii), we use Jensen’s inequality, and note that we can do this with respect
to just the random variable <script type="math/tex">X'</script>. (If this is confusing, just think of the
expression as a function of <script type="math/tex">X'</script> and ignore the outer expectation.) In (iii)
we apply a clever <em>symmetrization</em> trick by multiplying a Rademacher random
variable to <script type="math/tex">XX'</script>. The reason why we can do this is that <script type="math/tex">XX'</script> is already
symmetric about zero. Hence, inserting the Rademacher factor will maintain that
symmetry (since Rademachers are only +1 or 1). In (iv), we applied the
Rademacher subGaussian bound with <script type="math/tex">XX'</script> held fixed, and then in (v), we
finally use the fact that <script type="math/tex">X,X' \in [a,b]</script>.</p>
<p>Tricks used:</p>
<ul>
<li>Introducing an Independent Copy</li>
<li>Jensen’s Inequality</li>
<li>Series Expansions (twice!!)</li>
<li>Symmetrization</li>
</ul>
<p><strong>Comments</strong>: The first part is a classic exercise in theoretical statistics,
one which tests your ability to understand how to use the power series of
exponential functions. The first part involved converting an exponential
function to a power series, and then later doing <em>the reverse</em>. When I was doing
this problem, I found it easiest to start by stating the conclusion — that we
would have <script type="math/tex">e^{\frac{\lambda^2}{2}}</script> somehow — and then I worked backwards.
Obviously, this only works when the problem gives us the solution!</p>
<p>The next part is also “classic” in the sense that it’s often how students (such
as myself) are introduced to the symmetrization trick. The takeaway is that one
should be on the lookout for anything that seems symmetric. Or, failing that,
perhaps <em>introduce</em> symmetry by adding in an extra independent copy, as we did
above. But make sure that your random variables are zeromean!!</p>
<h2 id="example3concentrationaroundmedianandmeans">Example 3: Concentration Around Median and Means</h2>
<p>Here’s the question:</p>
<blockquote>
<p>Given a scalar random variable <script type="math/tex">X</script>, suppose that there are positive
constants <script type="math/tex">c_1,c_2</script> such that</p>
<script type="math/tex; mode=display">\mathbb{P}[X\mathbb{E}[X] \ge t] \le c_1e^{c_2t^2}</script>
<p>for all <script type="math/tex">t \ge 0</script>.</p>
<p>(a) Prove that <script type="math/tex">{\rm Var}(X) \le \frac{c_1}{c_2}</script></p>
<p>(b) Prove that for any median <script type="math/tex">m_X</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le c_3e^{c_4t^2}</script>
<p>for all <script type="math/tex">t \ge 0</script>, where <script type="math/tex">c_3 = 4c_1</script> and <script type="math/tex">c_4 = \frac{c_2}{8}</script>.</p>
</blockquote>
<p>To prove the first part, note that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
{\rm Var}(X) \;&{\overset{(i)}{=}}\; \mathbb{E}\Big[X\mathbb{E}[X]^2 \Big] \\
\;&{\overset{(ii)}{=}}\; 2 \int_{t=0}^\infty t \cdot \mathbb{P}[X\mathbb{E}[X] \ge t]dt \\
\;&{\overset{(iii)}{\le}}\; \frac{c_2}{c_2} \int_{t=0}^\infty 2t c_1e^{c_2t^2} dt \\
\;&{\overset{(iv)}{=}}\; \frac{c_1}{c_2},
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from the “integrating
probabilities into expectations” trick (which I will describe shortly), (iii)
follows from the provided bound, and (iv) follows from standard calculus (note
the multiplication of <script type="math/tex">c_2/c_2</script> for mathematical convenience). This proves the
first claim.</p>
<p>This second part requires some clever insights to get this to work. One way to
start is by noting that:</p>
<script type="math/tex; mode=display">\frac{1}{2} = \mathbb{P}[X \ge m_X] = \mathbb{P}\Big[X\mathbb{E}[X] \ge
m_X\mathbb{E}[X]\Big] \le c_1e^{c_2(m_X\mathbb{E}[X])^2}</script>
<p>and where the last inequality follows from the bound provided in the question.
For us to be able to apply that bound, assume without loss of generality that
<script type="math/tex">m_X \ge \mathbb{E}[X]</script>, meaning that our <script type="math/tex">t = m_X\mathbb{E}[X]</script> term is
positive and that we can increase the probability by inserting in absolute
values. The above also shows that</p>
<script type="math/tex; mode=display">m_X\mathbb{E}[X] \le \sqrt{\frac{\log(2c_1)}{c_2}}</script>
<p>We next tackle the core of the question. Starting from the left hand side of the
desired bound, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}[Xm_X \ge t] \;&{\overset{(i)}{=}}\; \mathbb{P}\Big[X + \mathbb{E}[X]  \mathbb{E}[X]m_X \ge t\Big] \\
\;&{\overset{(ii)}{\le}}\; \mathbb{P}\Big[X + \mathbb{E}[X]  \ge t  \mathbb{E}[X]  m_X\Big] \\
\;&{\overset{(iii)}{\le}}\; c_1e^{c_2(t  \mathbb{E}[X]  m_X)^2}
\end{align} %]]></script>
<p>where step (i) follows from adding zero, step (ii) follows from the Triangle
Inequality, and (iii) follows from the provided bound based on the expectation.
And yes, this is supposed to work only for when <script type="math/tex">t\mathbb{E}[X]m_X > 0</script>. The
way to get around this is that we need to assume <script type="math/tex">t</script> is greater than some
quantity. After some algebra, it turns out a nice condition for us to enforce is
that <script type="math/tex">t > \sqrt{\frac{8\log(4c_1)}{c_2}}</script>, which in turn will make
<script type="math/tex">t\mathbb{E}[X]m_X > 0</script>. If <script type="math/tex">% <![CDATA[
t < \sqrt{\frac{8\log(4c_1)}{c_2}} %]]></script>, then the
desired bound is attained because</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le 1 \le 4c_1 e^{\frac{c_2}{8}t^2}</script>
<p>a fact which can be derived through some algebra. Thus, the remainder of the
proof boils down to checking the case that when <script type="math/tex">t >
\sqrt{\frac{8\log(4c_1)}{c_2}}</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le c_1e^{c_2(t  \mathbb{E}[X]  m_X)^2} \le 4c_1 e^{\frac{c_2}{8}t^2}</script>
<p>and this is proved by analyzing roots of the quadratic and solving for <script type="math/tex">t</script>.</p>
<p>Tricks used:</p>
<ul>
<li>Integrating Probabilities into Expectations</li>
<li>Triangle Inequality</li>
</ul>
<p><strong>Comments</strong>: The trick “integrating probabilities into expectations” is one
which I only recently learned about, though one can easily find it (along with
the derivation) on the <a href="https://en.wikipedia.org/wiki/Expected_value">Wikipedia page for the expected values</a>. In
particular, note that for a positive real number <script type="math/tex">\alpha</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{E}[X^\alpha] = \alpha \int_{0}^\infty t^{\alpha1}\mathbb{P}[X \ge t]dt</script>
<p>and in the above, I use this trick with <script type="math/tex">\alpha=2</script>. It’s quite useful to
convert between probabilities and expectations!</p>
<p>The other trick above is using the triangle inequality in a clever way. The key
is to observe that when we have something like <script type="math/tex">\mathbb{P}[X\ge Y]</script>, if we
<em>increase</em> the value of <script type="math/tex">X</script>, then we increase that probability. This is
another common trick used in proving various bounds.</p>
<p>Finally, the above also shows that when we have constants <script type="math/tex">t</script>, it pays to be
clever in how we assign those values. Then the remainder is some bruteforce
computation. I suppose it also helps to think about inserting <script type="math/tex">1/2</script>s whenever
we have a probability and a median.</p>
<h2 id="example4upperboundsforell_0balls">Example 4: Upper Bounds for <script type="math/tex">\ell_0</script> “Balls”</h2>
<p>Consider the set</p>
<script type="math/tex; mode=display">T^d(s) = \{\theta \in \mathbb{R}^d \mid \\theta\_0 \le s, \\theta\_2 \le 1\}</script>
<p>We often write the number of nonzeros in <script type="math/tex">\theta</script> as <script type="math/tex">\\theta\_0</script> like
this even though <script type="math/tex">\\cdot\_0</script> is not technically a norm. This exercise
consists of three parts:</p>
<blockquote>
<p>(a) Show that <script type="math/tex">\mathcal{G}(T^d(s)) = \mathbb{E}[\max_{\mathcal{S}}
\w_S\_2]</script> where <script type="math/tex">\mathcal{S}</script> consists of all subsets <script type="math/tex">S</script> of
<script type="math/tex">\{1,2,\ldots, d\}</script> of size <script type="math/tex">s</script>, and <script type="math/tex">w_S</script> is a subvector of <script type="math/tex">w</script> (of
size <script type="math/tex">s</script>) indexed by those components. Note that by this definition, the
cardinality of <script type="math/tex">\mathcal{S}</script> is equal to <script type="math/tex">{d \choose s}</script>.</p>
<p>(b) Show that for any fixed subset <script type="math/tex">S</script> of cardinality <script type="math/tex">s</script>, we have
<script type="math/tex">\mathbb{P}[\w_S\_2 \ge \sqrt{s} + \delta] \le e^{\frac{\delta^2}{2}}</script>.</p>
<p>(c) Establish the claim that <script type="math/tex">\mathcal{G}(T^d(s)) \precsim \sqrt{s \log
\left(\frac{ed}{s}\right)}</script>.</p>
</blockquote>
<p>To be clear on the notation, <script type="math/tex">\mathcal{G}(T^d(s)) =
\mathbb{E}\left[\sup_{\theta \in T^d(s)} \langle \theta, w \rangle\right]</script> and
refers to the <em>Gaussian complexity</em> of that set. It is, roughly speaking, a way
to measure the “size” of a set.</p>
<p>To prove (a), let <script type="math/tex">\theta \in T^d(s)</script> and let <script type="math/tex">S</script> indicate the support of
<script type="math/tex">\theta</script> (i.e. where its nonzeros occur). For any <script type="math/tex">w \in \mathbb{R}^d</script>
(which we later treat to be sampled from <script type="math/tex">N(0,I_d)</script>, though the immediate
analysis below does not require that fact) we have</p>
<script type="math/tex; mode=display">\langle \theta, w \rangle =
\langle \tilde{\theta}, w_S \rangle \le
\\tilde{\theta}\_2 \w_S\_2 \le
\w_S\_2,</script>
<p>where <script type="math/tex">\tilde{\theta}\in \mathbb{R}^s</script> refers to the vector taking only the
nonzero components from <script type="math/tex">\theta</script>. The first inequality follows from
CauchySchwarz. In addition, by standard norm properties, taking <script type="math/tex">\theta =
\frac{w_S}{\w_S\_2} \in T^d(s)</script> results in the case when equality is
attained. The claim thus follows. (There are some technical details needed
regarding which of the maximums — over the set sizes or over the vector
selection — should come first, but I don’t think the details are critical for
me to know.)</p>
<p>For (b), we first claim that the function <script type="math/tex">f_S : \mathbb{R}^d \to \mathbb{R}</script>
defined as <script type="math/tex">f_S(w) := \w_S\_2</script> is Lipschitz with respect to the Euclidean
norm with Lipschitz constant <script type="math/tex">L=1</script>. To see this, observe that when <script type="math/tex">w</script> and
<script type="math/tex">w'</script> are both <script type="math/tex">d</script>dimensional vectors, we have</p>
<script type="math/tex; mode=display">f_S(w)f_S(w') =
\Big\w_S\_2\w_S'\_2\Big \;{\overset{(i)}{\le}}\;
\w_Sw_S'\_2 \;{\overset{(ii)}{\le}}\;
\ww'\_2,</script>
<p>where (i) follows from the reverse triangle inequality for normed spaces and
(ii) follows from how the vector <script type="math/tex">w_Sw_S'</script> cannot have more nonzero terms
than <script type="math/tex">ww'</script> but must otherwise match it for indices lying in the subset <script type="math/tex">S</script>.</p>
<p>The fact that <script type="math/tex">f_S</script> is Lipschitz means that we can apply a theorem regarding
tail bounds of Lipschitz functions of Gaussian variables. The function <script type="math/tex">f_S</script>
here doesn’t <em>require</em> its input to consist of vectors with IID standard
Gaussian components, but we have to assume that the input is like that for the
purposes of the theorem/bound to follow. More formally, for all <script type="math/tex">\delta \ge 0</script>
we have</p>
<script type="math/tex; mode=display">\mathbb{P}\Big[\w_S\_2 \ge \sqrt{s} + \delta\Big] \;{\overset{(i)}{\le}}\;
\mathbb{P}\Big[\w_S\_2 \ge \mathbb{E}[\w_S\_2] + \delta \Big]\;{\overset{(ii)}{\le}}\;
e^{\frac{\delta^2}{2}}</script>
<p>where (i) follows from how <script type="math/tex">\mathbb{E}[\w_S\_2] \le \sqrt{s}</script> and thus we
are just decreasing the threshold for the event (hence making it more likely)
and (ii) follows from the theorem, which provides an <script type="math/tex">L</script> in the denominator of
the exponential, but <script type="math/tex">L=1</script> here.</p>
<p>Finally, to prove (c), we first note that the previous part’s theorem guaranteed
that the function <script type="math/tex">f_S(w) = \w_S\_2</script> is subGaussian with parameter
<script type="math/tex">\sigma=L=1</script>. Using this, we have</p>
<script type="math/tex; mode=display">\mathcal{G}(T^d(s)) = \mathbb{E}\Big[\max_{S \in \mathcal{S}} \w_S\_2\Big]
\;{\overset{(i)}{\le}}\; \sqrt{2 \sigma^2 \log {d \choose s}}
\;{\overset{(ii)}{\precsim}}\; \sqrt{s \log \left(\frac{ed}{s}\right)}</script>
<p>where (i) applies the bound for a maximum over subGaussian random variables
<script type="math/tex">\w_S\_2</script> for all the <script type="math/tex">{d\choose s}</script> sets <script type="math/tex">S \in \mathcal{S}</script> (see
Example 1 earlier), each with parameter <script type="math/tex">\sigma</script>, and (ii) applies an
approximate bound due to Stirling’s approximation and ignores the constants of
<script type="math/tex">\sqrt{2}</script> and <script type="math/tex">\sigma</script>. The careful reader will note that Example 1
required <em>zero</em>mean subGaussian random variables, but we can generally get
around this by, I believe, subtracting away a mean and then readding later.</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Jensen’s Inequality</li>
<li>Lipschitz Functions</li>
<li>Norm Properties</li>
<li>Stirling’s Approximation</li>
<li>Triangle Inequality</li>
</ul>
<p><strong>Comments</strong>: This exercise involves a number of tricks. The fact that
<script type="math/tex">\mathbb{E}[\w_S\_2] \le \sqrt{s}</script> follows from how</p>
<script type="math/tex; mode=display">\mathbb{E}[\w_S\_2] = \mathbb{E}\Big[\sqrt{\w_S\_2^2}\Big] \le
\sqrt{\mathbb{E}[\w_S\_2^2]} = \sqrt{s}</script>
<p>due to Jensen’s inequality and how <script type="math/tex">\mathbb{E}[X^2]=1</script> for <script type="math/tex">X \sim
N(0,1)</script>. Fiddling with norms, expectations, and square roots is another common
way to utilize Jensen’s inequality (in addition to using Jensen’s inequality
with the exponential function, as explained earlier). Moreover, if you see norms
in a probabilistic bound statement, you should immediately be thinking of the
possibility of using a theorem related to Lipschitz functions.</p>
<p>The example also uses the (reverse!) triangle inequality for norms:</p>
<script type="math/tex; mode=display">\Big \x\_2\y\_2\Big \le \xy\_2</script>
<p>This can come up quite often and is the noncanonical way of viewing the
triangle inequality, so watch out!</p>
<p>Finally, don’t forget the trick where we have <script type="math/tex">{d \choose s} \le
\left(\frac{ed}{s}\right)^s</script>. This comes from <a href="https://math.stackexchange.com/questions/132625/nchoosekleqleftfracenkrightk">an application of Stirling’s
approximation</a> and is seen frequently in cases involving <em>sparsity</em>, where
<script type="math/tex">s</script> components are “selected” out of <script type="math/tex">d \gg s</script> total. The maximum over a
finite set should also provide a big hint regarding the use of a subGaussian
bound over maximums of (subGaussian) variables.</p>
<h2 id="example5gaussiancomplexityofellipsoids">Example 5: Gaussian Complexity of Ellipsoids</h2>
<blockquote>
<p>Recall that the space <script type="math/tex">\ell_2(\mathbb{N})</script> consists of all real sequences
<script type="math/tex">\{\theta_j\}_{j=1}^\infty</script> such that <script type="math/tex">\sum_{j=1}^\infty \theta_j^2 \le
\infty</script>. Given a strictly positive sequence <script type="math/tex">\{\mu_j\}_{j=1}^\infty \in \ell_2(\mathbb{N})</script>,
consider the associated ellipse</p>
<script type="math/tex; mode=display">\mathcal{E} := \left\{\{\theta_j\}_{j=1}^\infty \in \ell_2(\mathbb{N}) \;\Big
\sum_{j=1}^\infty \frac{\theta_j^2}{\mu_j^2} \le 1\right\}</script>
<p>(a) Prove that the Gaussian complexity satisfies the bounds</p>
<script type="math/tex; mode=display">\sqrt{\frac{2}{\pi}}\left(\sum_{j=0}^\infty \mu_j^2 \right)^{1/2} \le
\mathcal{G}(\mathcal{E}) \le \left(\sum_{j=0}^\infty \mu_j^2 \right)^{1/2}</script>
<p>(b) For a given radius <script type="math/tex">r > 0</script>, consider the truncated set</p>
<script type="math/tex; mode=display">\tilde{\mathcal{E}} := \mathcal{E} \cap \left\{\{\theta_j\}_{j=1}^\infty
\;\Big \sum_{j=1}^\infty \theta_j^2 \le r^2 \right\}</script>
<p>Obtain upper and lower bounds on its Gaussian complexity that are tight up to
universal constants independent of <script type="math/tex">r</script> and <script type="math/tex">\{\mu_j\}_{j=1}^\infty</script>.</p>
</blockquote>
<p>To prove (a), we first start with the upper bound. Letting <script type="math/tex">w</script> indicate a
sequence of IID standard Gaussians <script type="math/tex">w_i</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{G}(\mathcal{E}) \;&{\overset{(i)}{=}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty w_i\theta_i \right] \\
\;&{\overset{(ii)}{=}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \frac{\theta_i}{\mu_i}w_i\mu_i \right] \\
\;&{\overset{(iii)}{\le}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}} \left(\sum_{i=1}^\infty\frac{\theta_i^2}{\mu_i^2}\right)^{1/2}\left(\sum_{i=1}^\infty w_i^2 \mu_i^2\right)^{1/2} \right] \\
\;&{\overset{(iv)}{\le}}\; \mathbb{E}_w\left[ \left(\sum_{i=1}^\infty w_i^2 \mu_i^2 \right)^{1/2} \right] \\
\;&{\overset{(v)}{\le}}\; \sqrt{\mathbb{E}_w\left[ \sum_{i=1}^\infty w_i^2 \mu_i^2 \right]} \\
\;&{\overset{(vi)}{=}}\; \left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from multiplying by one, (iii)
follows from a clever application of the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">CauchySchwarz inequality</a> for
sequences (or more generally, <a href="https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality">Holder’s Inequality</a>), (iv) follows from the
definition of <script type="math/tex">\mathcal{E}</script>, (v) follows from Jensen’s inequality, and (vi)
follows from linearity of expectation and how <script type="math/tex">\mathbb{E}_{w_i}[w_i^2]=1</script>.</p>
<p>We next prove the lower bound. First, we note a wellknown result that
<script type="math/tex">\sqrt{\frac{2}{\pi}}\mathcal{R}(\mathcal{E}) \le \mathcal{G}(\mathcal{E})</script>
where <script type="math/tex">\mathcal{R}(\mathcal{E})</script> indicates the <em>Rademacher</em> complexity of the
set. Thus, our task now boils down to showing that <script type="math/tex">\mathcal{R}(\mathcal{E}) =
\left(\sum_{i=1}^\infty \mu_i^2 \right)^{1/2}</script>. Letting <script type="math/tex">\varepsilon_i</script> be
IID Rademachers, we first begin by proving the <em>upper</em> bound</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{R}(\mathcal{E}) \;&{\overset{(i)}{=}}\; \mathbb{E}_\varepsilon\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \varepsilon_i\theta_i \right] \\
\;&{\overset{(ii)}{=}}\; \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \Big\frac{\theta_i}{\mu_i}\mu_i\Big \\
\;&{\overset{(iii)}{\le}}\; \sup_{\theta \in \mathcal{E}} \left(\sum_{i=1}^\infty\frac{\theta_i^2}{\mu_i^2}\right)^{1/2}\left(\sum_{i=1}^\infty \mu_i^2\right)^{1/2} \\
\;&{\overset{(iv)}{=}}\; \left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from the symmetric nature of the
class of <script type="math/tex">\theta</script> (meaning that WLOG we can pick <script type="math/tex">\varepsilon_i = 1</script> for all
<script type="math/tex">i</script>) and then multiplying by one, (iii), follows from CauchySchwarz again,
and (iv) follows from the provided bound in the definition of <script type="math/tex">\mathcal{E}</script>.</p>
<p>We’re not done yet: we actually need to show <em>equality</em> for this, or at the very
least prove a <em>lower</em> bound instead of an upper bound. However, if one chooses
the valid sequence <script type="math/tex">\{\theta_j\}_{j=1}^\infty</script> such that <script type="math/tex">\theta_j =
\mu_j^2 / (\sum_{j=1}^\infty \mu_j^2)^{1/2}</script>, then equality is attained since we
get</p>
<script type="math/tex; mode=display">\frac{\sum_{i=1}^\infty \mu_i^2}{\left(\sum_{i=1}^\infty \mu_i^2\right)^{1/2}} =
\left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}</script>
<p>in one of our steps above. This proves part (a).</p>
<p>For part (b), we construct two ellipses, one that contains
<script type="math/tex">\tilde{\mathcal{E}}</script> and one which is contained inside it. Let <script type="math/tex">m_i :=
\min\{\mu_i, r\}</script>. Then we claim that the ellipse <script type="math/tex">\mathcal{E}_{m}</script> defined
out of this sequence (i.e. treating “<script type="math/tex">m</script>” as our “<script type="math/tex">\mu</script>”) will be contained
in <script type="math/tex">\tilde{\mathcal{E}}</script>. We moreover claim that the ellipse
<script type="math/tex">\mathcal{E}^{m}</script> defined out of the sequence <script type="math/tex">\sqrt{2} \cdot m_i</script> for all
<script type="math/tex">i</script> contains <script type="math/tex">\tilde{\mathcal{E}}</script>, i.e. <script type="math/tex">\mathcal{E}_m \subset
\tilde{\mathcal{E}} \subset \mathcal{E}^m</script>. If this is true, it then follows
that</p>
<script type="math/tex; mode=display">\mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m)</script>
<p>because the definition of Gaussian complexity requires taking a maximum of
<script type="math/tex">\theta</script> over a set, and if the set grows larger via set containment, then the
Gaussian complexity can only grow larger. In addition, the fact that the upper
and lower bounds are related by a constant <script type="math/tex">\sqrt{2}</script> suggests that there
should be extra lower and upper bounds utilizing universal constants independent
of <script type="math/tex">r</script> and <script type="math/tex">\mu</script>.</p>
<p>Let us prove the two set inclusions previously described, as well as develop the
desired upper and lower bounds. Suppose <script type="math/tex">\{\theta_j\}_{j=1}^\infty \in
\mathcal{E}_m</script>. Then we have</p>
<script type="math/tex; mode=display">\sum_{i=1}^\infty \frac{\theta_i^2}{r^2} \le \sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{r,\mu_j\})^2} \le 1</script>
<p>and</p>
<script type="math/tex; mode=display">\sum_{i=1}^\infty \frac{\theta_i^2}{\mu_i^2} \le \sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{r,\mu_j\})^2} \le 1</script>
<p>In both cases, the first inequality is because we can only decrease the value in
the denominator.<sup id="fnref:downstairs"><a href="#fn:downstairs" class="footnote">2</a></sup> The last inequality follows by assumption of
membership in <script type="math/tex">\mathcal{E}_m</script>. Both requirements for membership in
<script type="math/tex">\tilde{\mathcal{E}}</script> are satisfied, and therefore,
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \mathcal{E}_m</script> implies
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \tilde{\mathcal{E}}</script> and thus the first set
containment. Moving on to the second set containment, suppose
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \tilde{\mathcal{E}}</script>. We have</p>
<script type="math/tex; mode=display">\frac{1}{2}\sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{\mu_i,r\})^2}
\;{\overset{(i)}{\le}}\;
\frac{1}{2}\left( \sum_{i=1}^\infty \frac{\theta_i^2}{r^2}+\sum_{i=1}^\infty
\frac{\theta_i^2}{\mu_i^2}\right)
\;{\overset{(ii)}{\le}}\; 1</script>
<p>where (i) follows from a “union bound”style argument, which to be clear,
happens because for every term <script type="math/tex">i</script> in the summation, we have either
<script type="math/tex">\frac{\theta_i^2}{r^2}</script> or <script type="math/tex">\frac{\theta_i^2}{\mu_i^2}</script> added to the
summation (both positive quantities). Thus, to make the value <em>larger</em>, just add
<em>both</em> terms! Step (ii) follows from the assumption of membership in
<script type="math/tex">\tilde{\mathcal{E}}</script>. Thus, we conclude that <script type="math/tex">\{\theta_j\}_{j=1}^\infty \in
\mathcal{E}_m</script>, and we have proved that</p>
<script type="math/tex; mode=display">\mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m)</script>
<p>The final step of this exercise is to develop a lower bound on the left hand
side and an upper bound on the right hand side that are close up to universal
constants. But we have reduced this to an instance of part (a)! Thus, we simply
apply the lower bound for <script type="math/tex">\mathcal{G}(\mathcal{E}_m)</script> and the upper bound for
<script type="math/tex">\mathcal{G}(\mathcal{E}^m)</script> and obtain</p>
<script type="math/tex; mode=display">\sqrt{\frac{2}{\pi}}\left(\sum_{i=1}^\infty m_i^2 \right)^{1/2}
\le \mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m) \le
\sqrt{2}\left(\sum_{i=1}^\infty m_i^2 \right)^{1/2}</script>
<p>as our final bounds on <script type="math/tex">\mathcal{G}(\tilde{\mathcal{E}})</script>. (Note that
as a sanity check, the constant offset <script type="math/tex">\sqrt{1/\pi} \approx 0.56</script> is less
than one.) This proves part (b).</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Jensen’s Inequality</li>
<li>Union Bound</li>
</ul>
<p><strong>Comments</strong>: This exercise on the surface looks extremely challenging. How does
one reason about multiple infinite sequences, which furthermore may or may not
involve squared terms? I believe the key to tackling these problems is to
understand how to apply CauchySchwarz (or more generally, Holder’s Inequality)
for infinite sequences. More precisely, Holder’s Inequality for sequences
spaces states that</p>
<script type="math/tex; mode=display">\sum_{k=1}^\infty x_ky_k \le \left(\sum_{k=1}^\infty x_k^2 \right)^{1/2}\left( \sum_{k=1}^\infty y_k^2 \right)^{1/2}</script>
<p>(It’s actually more general for this, since we can assume arbitrary positive
powers <script type="math/tex">p</script> and <script type="math/tex">q</script> so long as <script type="math/tex">1/p + 1/q=1</script>, but the easiest case to
understand is when <script type="math/tex">p=q=2</script>.)</p>
<p>Holder’s Inequality is <em>enormously helpful</em> when dealing with sums (whether
infinite or not), and <em>especially</em> when dealing with two sums if one does <em>not</em>
square its terms, but the other one <em>does</em>.</p>
<p>Finally, again, think about Jensen’s inequality whenever we have expectations
and a square root!</p>
<h2 id="example6pairwiseincoherence">Example 6: Pairwise Incoherence</h2>
<blockquote>
<p>Given a matrix <script type="math/tex">X \in \mathbb{R}^{n \times d}</script>, suppose it has normalized
columns (<script type="math/tex">\X\_2/\sqrt{n} = 1</script> for all <script type="math/tex">j = 1,...,d</script>) and pairwise
incoherence upper bounded as <script type="math/tex">% <![CDATA[
\delta_{\rm PW}(X) < \gamma %]]></script>.</p>
<p>(a) Let <script type="math/tex">S \subset \{1,2,\ldots,d\}</script> be any subset of size <script type="math/tex">s</script>. Show that
there is a function <script type="math/tex">\gamma \to c(\gamma)</script> such that <script type="math/tex">\lambda_{\rm
min}\left(\frac{X_S^TX_S}{n}\right) \ge c(\gamma) > 0</script> as long as <script type="math/tex">\gamma</script>
is sufficiently small, where <script type="math/tex">X_S</script> is the <script type="math/tex">n\times s</script> matrix formed by
extracting the <script type="math/tex">s</script> columns of <script type="math/tex">X</script> whose indices are in <script type="math/tex">S</script>.</p>
<p>(b) Prove, from first principles, that <script type="math/tex">X</script> satisfies the restricted
nullspace property with respect to <script type="math/tex">S</script> as long as <script type="math/tex">% <![CDATA[
\gamma < 1/3 %]]></script>.</p>
</blockquote>
<p>To clarify, the <em>pairwise incoherence</em> of a matrix <script type="math/tex">X \in \mathbb{R}^{n \times
d}</script> is defined as</p>
<script type="math/tex; mode=display">\delta_{\rm PW}(X) := \max_{j,k = 1,2,\ldots, d}
\left\frac{\langle X_j, X_k \rangle}{n}  \mathbb{I}[j\ne k]\right</script>
<p>where <script type="math/tex">X_i</script> denotes the <script type="math/tex">i</script>th <em>column</em> of <script type="math/tex">X</script>. Intuitively, it measures
the correlation between any columns, though it subtracts an indicator at the end
so that the maximal case does not always correspond to the case when <script type="math/tex">j=k</script>. In
addition, the matrix <script type="math/tex">\frac{X_S^TX_S}{n}</script> as defined in the problem looks like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{X_S^TX_S}{n} =
\begin{bmatrix}
\frac{(X_S)_1^T(X_S)_1}{n} & \frac{(X_S)_1^T(X_S)_2}{n} & \cdots & \frac{(X_S)_1^T(X_S)_s}{n} \\
\frac{(X_S)_1^T(X_S)_2}{n} & \frac{(X_S)_2^T(X_S)_2}{n} & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
\frac{(X_S)_1^T(X_S)_s}{n} & \cdots & \cdots & \frac{(X_S)_s^T(X_S)_s}{n} \\
\end{bmatrix} =
\begin{bmatrix}
1 & \frac{(X_S)_1^T(X_S)_2}{n} & \cdots & \frac{(X_S)_1^T(X_S)_s}{n} \\
\frac{(X_S)_1^T(X_S)_2}{n} & 1 & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
\frac{(X_S)_1^T(X_S)_s}{n} & \cdots & \cdots & 1 \\
\end{bmatrix} %]]></script>
<p>where the 1s in the diagonal are due to the assumption of having normalized columns.</p>
<p>First, we prove part (a). Starting from the <em>variational representation</em> of the
minimum eigenvalue, we consider any possible <script type="math/tex">v \in \mathbb{R}^s</script> with
Euclidean norm one (and thus this analysis will apply for the <em>minimizer</em>
<script type="math/tex">v^*</script> which induces the minimum eigenvalue) and observe that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
v^T\frac{X_S^TX_S}{n}v \;&{\overset{(i)}{=}}\; \sum_{i=1}^sv_i^2 + 2\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \\
\;&{\overset{(ii)}{=}}\; 1 + 2\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \\
\;&{\overset{(iii)}{\ge}}\; 1  2\frac{\gamma}{s}\sum_{i<j}^sv_iv_j \\
\;&{\overset{(iv)}{=}}\; 1  \frac{\gamma}{s}\left((v_i + \cdots + v_s)^2\sum_{i=1}^sv_i^2\right) \\
\;&{\overset{(v)}{\ge}}\; 1  \frac{\gamma}{s}\Big(s\v\_2^2)\v\_2^2\Big)
\end{align} %]]></script>
<p>where (i) follows from the definition of a quadratic form (less formally, by
matrix multiplication), (ii) follows from the <script type="math/tex">\v\_2 = 1</script> assumption, (iii)
follows from noting that</p>
<script type="math/tex; mode=display">% <![CDATA[
\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \le \frac{\gamma}{s}\sum_{i<j}^sv_iv_j %]]></script>
<p>which in turn follows from the pairwise incoherence assumption that
<script type="math/tex">\Big\frac{(X_S)_i^T(X_S)_j}{n}\Big \le \frac{\gamma}{s}</script>. Step (iv) follows
from definition, and (v) follows from how <script type="math/tex">\v\_1 \le \sqrt{s}\v\_2</script> for
<script type="math/tex">s</script>dimensional vectors.</p>
<p>The above applies for any satisfactory <script type="math/tex">v</script>. Putting together the pieces, we
conclude that</p>
<script type="math/tex; mode=display">\lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) = \inf_{\v\_2=1} v^T\frac{X_S^TX_S}{n}v
\ge \underbrace{1  \gamma \frac{s1}{s}}_{c(\gamma)} \ge 1\gamma,</script>
<p>which follows if <script type="math/tex">\gamma</script> is sufficiently small.</p>
<p>To prove the restricted <a href="https://en.wikipedia.org/wiki/Nullspace_property">nullspace property</a> in (b), we first suppose that
<script type="math/tex">\theta \in \mathbb{R}^d</script> and <script type="math/tex">\theta \in {\rm null}(X) \setminus \{0\}</script>.
Define <script type="math/tex">d</script>dimensional vectors <script type="math/tex">\tilde{\theta}_S</script> and
<script type="math/tex">\tilde{\theta}_{S^c}</script> which match components of <script type="math/tex">\theta</script> for the indices
within their respective sets <script type="math/tex">S</script> or <script type="math/tex">S^c</script>, and which are zero
otherwise.<sup id="fnref:time"><a href="#fn:time" class="footnote">3</a></sup> Supposing that <script type="math/tex">S</script> corresponds to the subset of indices of
<script type="math/tex">\theta</script> of the <script type="math/tex">s</script> largest elements in absolute value, it suffices to show
that <script type="math/tex">\\tilde{\theta}_{S^c}\_1 > \\tilde{\theta}_S\_1</script>, because then we
can <em>never</em> violate this inequality (and thus the restricted nullspace property
holds).</p>
<p>We first show a few facts which we then piece together to get the final result.
The first is that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
0 \;&{\overset{(i)}{=}}\; \X\theta \_2^2 \\
\;&{\overset{(ii)}{=}}\; \X\tilde{\theta}_S + X\tilde{\theta}_{S^c}\_2^2 \\
\;&{\overset{(iii)}{=}}\; \X\tilde{\theta}_S\_2^2 + \X\tilde{\theta}_{S^c}\_2^2 + 2\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\\
\;&{\overset{(iv)}{\ge}}\; n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right)  2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big
\end{align} %]]></script>
<p>where (i) follows from the assumption that <script type="math/tex">\theta</script> is in the kernel of <script type="math/tex">X</script>,
(ii) follows from how <script type="math/tex">\theta = \tilde{\theta}_S + \tilde{\theta}_{S^c}</script>,
(iii) follows from expanding the term, and (iv) follows from carefully noting
that</p>
<script type="math/tex; mode=display">\lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) = \min_{v \in \mathbb{R}^d}
\frac{v^T\frac{X_S^TX_S}{n}v}{v^Tv} \le
\frac{\theta_S^T\frac{X_S^TX_S}{n}\theta_S}{\\theta_S\_2^2}</script>
<p>where in the inequality, we have simply chosen <script type="math/tex">\theta_S</script> as our <script type="math/tex">v</script>, which
can only make the bound worse. Then step (iv) follows immediately. Don’t forget
that <script type="math/tex">\\theta_S\_2^2 = \\tilde{\theta}_S\_2^2</script>, because the latter
involves a vector that (while longer) only has extra zeros. Incidentally, the
above uses the variational representation for eigenvalues in a way that’s more
convenient if we don’t want to restrict our vectors to have Euclidean norm one.</p>
<p>We conclude from the above that</p>
<script type="math/tex; mode=display">n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) \le 2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big</script>
<p>Next, let us upper bound the RHS. We see that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big\;&{\overset{(i)}{=}}\; \Big\theta_S^T(X_S^TX_{S^c})\theta_{S^c}\Big\\
\;&{\overset{(ii)}{=}}\; \left \sum_{i\in S, j\in S^c} X_i^TX_j (\tilde{\theta}_S)_i(\tilde{\theta}_{S^c})_j \right \\
\;&{\overset{(iii)}{\le}}\; \frac{n\gamma}{s} \sum_{i\in S, j\in S^c} (\tilde{\theta}_S)_i(\tilde{\theta}_{S^c})_j \\
\;&{\overset{(iv)}{=}}\; \frac{n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1
\end{align} %]]></script>
<p>where (i) follows from a little thought about how matrix multiplication and
quadratic forms work. In particular, if we expanded out the LHS, we would get a
sum with lots of terms that are zero since <script type="math/tex">(\tilde{\theta}_S)_i</script> or
<script type="math/tex">(\tilde{\theta}_{S^c})_j</script> would cancel them out. (To be clear, <script type="math/tex">\theta_S \in
\mathbb{R}^s</script> and <script type="math/tex">\theta_{S^c} \in \mathbb{R}^{ds}</script>.) Step (ii) follows
from definition, step (iii) follows from the provided Pairwise Incoherence bound
(note the need to multiply by <script type="math/tex">n/n</script>), and step (iv) follows from how</p>
<script type="math/tex; mode=display">\\theta_S\_1\\theta_{S^c}\_1 = \Big((\theta_S)_1 +\cdots+ (\theta_S)_s\Big)
\Big((\theta_{S^c})_1 +\cdots+ (\theta_{S^c})_{ds}\Big)</script>
<p>and thus it is clear that the product of the <script type="math/tex">L_1</script> norms consists of the sum
of all possible combination of indices with nonzero values.</p>
<p>The last thing we note is that from part (a), if we assumed that <script type="math/tex">\gamma \le
1/3</script>, then a lower bound on <script type="math/tex">\lambda_{\rm min}
\left(\frac{X_S^TX_S}{n}\right)</script> is <script type="math/tex">2/3</script>. Putting the pieces together, we
get the following three inequalities</p>
<script type="math/tex; mode=display">\frac{2n\\theta_S\_2^2}{3} \;\;\le \;\;
n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) \;\;\le \;\;
2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big \;\; \le \;\;
\frac{2n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1</script>
<p>We can provide a lower bound for the first term above. Using the fact that
<script type="math/tex">\\theta_S\_1^2 \le s\\theta_S\_2^2</script>, we get
<script type="math/tex">\frac{2n\\theta_S\_1^2}{3s} \le \frac{2n\\theta_S\_2^2}{3}</script>. The final
step is to tie the lower bound here with the upper bound from the set of three
inequalities above. This results in</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{2n\\theta_S\_1^2}{3s} \le \frac{2n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1 \quad &\iff \quad
\frac{\\theta_S\_1^2}{3} \le \gamma \\theta_S\_1\\theta_{S^c}\_1 \\
&\iff \quad \\theta_S\_1 \le 3\gamma \\theta_{S^c}\_1
\end{align} %]]></script>
<p>Under the same assumption earlier (that <script type="math/tex">% <![CDATA[
\gamma < 1/3 %]]></script>) it follows directly
that <script type="math/tex">% <![CDATA[
\\theta_S\_1 < \\theta_{S^c}\_1 %]]></script>, as claimed. Whew!</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Norm Properties</li>
<li>Variational Representation (of eigenvalues)</li>
</ul>
<p><strong>Comments</strong>: Actually, for part (a), one can prove this more directly by using
the <a href="https://en.wikipedia.org/wiki/Gershgorin_circle_theorem">Gershgorin Circle Theorem</a>, a <em>very</em> useful Theorem with a surprisingly
simple proof. But I chose this way above so that we can make use of the
variational representation for eigenvalues. There are also variational
representations for <em>singular values</em>.</p>
<p>The above uses a <em>lot</em> of norm properties. One example was the use of <script type="math/tex">\v\_1
\le \sqrt{s}\v\_2</script>, which can be proved via CauchySchwarz. The extension to
this is that <script type="math/tex">\v\_2 \le \sqrt{s}\v\_\infty</script>. These are quite handy.
Another example, which is useful when dealing with specific subsets, is to
understand how the <script type="math/tex">L_1</script> and <script type="math/tex">L_2</script> norms behave. Admittedly, getting all the
steps right for part (b) takes a <em>lot</em> of hassle and attention to details, but
it is certainly satisfying to see it work.</p>
<h2 id="closingthoughts">Closing Thoughts</h2>
<p>I hope this post serves as a useful reference for me and to anyone else who
might need to use one of these tricks to understand some machine learning and
statisticsrelated math.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:miller">
<p>One of my undergraduate mathematics professors, <a href="https://web.williams.edu/Mathematics/sjmiller/public_html/williams/welcome.html">Steven J.
Miller</a>, would love this trick, as his two favorite tricks in mathematics
are <em>adding zero</em> (along with, of course, multiplying by one). <a href="#fnref:miller" class="reversefootnote">↩</a></p>
</li>
<li id="fn:downstairs">
<p>Or “downstairs” as professor <a href="https://people.eecs.berkeley.edu/~jordan/">Michael I. Jordan</a> often puts it
(and obviously, “upstairs” for the numerator). <a href="#fnref:downstairs" class="reversefootnote">↩</a></p>
</li>
<li id="fn:time">
<p>It can take some time and effort to visualize and process all this
information. I find it helpful to draw some of these out with pencil and
paper, and also to assume without loss of generality that <script type="math/tex">S</script> corresponds
to the first “block” of <script type="math/tex">\theta</script>, and <script type="math/tex">S^c</script> therefore corresponds to the
second (and last) “block.” Please contact me if you spot typos; they’re
really easy to make here. <a href="#fnref:time" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 06 May 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/05/06/mathematicaltrickscommonlyusedinmachinelearningandstatistics
https://danieltakeshi.github.io/2017/05/06/mathematicaltrickscommonlyusedinmachinelearningandstatisticsFollowing Professor Michael I. Jordan's Advice: "Your Brain Needs Exercise"<p>The lone class I am taking this semester is STAT 210B, the second course in the
PhDlevel theoretical statistics sequence. I took STAT 210A last semester, and I
briefly <a href="https://danieltakeshi.github.io/2016/12/20/reviewoftheoreticalstatisticsstat210aatberkeley/">wrote about the class here</a>. I’ll have more to say about STAT 210B
in late May, but in this post I’d first like to present an interesting problem
that our professor, <a href="https://people.eecs.berkeley.edu/~jordan/">Michael I. Jordan</a>, brought up in lecture a few weeks
ago.</p>
<p>The problem Professor Jordan discussed was actually an old homework question,
but he said that it was so important for us to know this that he was going to
prove it in lecture anyway, <em>without</em> using any notes whatsoever. He also
stated:</p>
<blockquote>
<p>“Your brain needs exercise.”</p>
</blockquote>
<p>He then went ahead and successfully proved it, and urged us to do the same
thing.</p>
<p>OK, if he says to do that, then I will follow his advice and write out my answer
in this blog post. I’m probably the only student in class who’s going to be
doing this, but I’m already a bit unusual in having a longrunning blog. If any
of my classmates are reading this and have their own blogs, let me know!</p>
<p>By the way, for all the students out there who say that they don’t have time to
maintain personal blogs, why not take baby steps and start writing about stuff
that accomplishes your educational objectives, such as doing practice exercises?
It’s a nice way to make yourself look more productive than you actually are,
since you would be doing those anyway.</p>
<p>Anyway, here at last is the question Professor Jordan talked about:</p>
<blockquote>
<p>Let <script type="math/tex">\{X_i\}_{i=1}^n</script> be a sequence of zeromean random variables, each
subGaussian with parameter <script type="math/tex">\sigma</script> (No independence assumptions are
needed). Prove that</p>
<script type="math/tex; mode=display">\mathbb{E}\Big[\max_{i=1,\ldots,n}X_i\Big] \le \sqrt{2\sigma^2 \log n}</script>
<p>for all <script type="math/tex">n\ge 1</script>.</p>
</blockquote>
<p>This problem is certainly on the easier side of the homework questions we’ve
had, but it’s a good baseline and I’d like to showcase the solution here. Like
Professor Jordan, I will do this problem (a.k.a. write this blog post) without
any form of notes. Here goes: for <script type="math/tex">\lambda \ge 0</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
e^{\lambda \mathbb{E}[\max\{X_1, \ldots, X_n\}]} \;&{\overset{(i)}{\le}}\;\mathbb{E}[e^{\lambda \max\{X_1,\ldots,X_n\}}] \\
\;&{\overset{(ii)}{=}}\; \mathbb{E}[\max\{e^{\lambda X_1},\ldots,e^{\lambda X_n}\}] \\
\;&{\overset{(iii)}{\le}}\; \sum_{i=1}^n\mathbb{E}[e^{\lambda X_i}] \\
\;&{\overset{(iv)}{\le}}\; ne^{\frac{\lambda^2\sigma^2}{2}}
\end{align} %]]></script>
<p>where:</p>
<ul>
<li><strong>Step (i)</strong> follows from Jensen’s inequality. Yeah, that inequality is
<em>everywhere</em>.</li>
<li><strong>Step (ii)</strong> follows from noting that one can pull the maximum outside of the
exponential.</li>
<li><strong>Step (iii)</strong> follows from the classic union bound, which can be pretty bad
but we don’t have much else to go on here. The key fact is that the
exponential makes all terms in the sum positive.</li>
<li><strong>Step (iv)</strong> follows from applying the subGaussian bound to all <script type="math/tex">n</script>
variables, and then summing them together.</li>
</ul>
<p>Next, taking logs and rearranging, we have</p>
<script type="math/tex; mode=display">\mathbb{E}\Big[\max\{X_1, \ldots, X_n\}\Big] \le \frac{\log n}{\lambda} + \frac{\lambda\sigma^2}{2}</script>
<p>Since <script type="math/tex">\lambda \in \mathbb{R}</script> is isolated on the right hand side, we can
differentiate it to find the <em>tightest</em> lower bound. Doing so, we get
<script type="math/tex">\lambda^* = \frac{\sqrt{2 \log n}}{\sigma}</script>. Plugging this back in, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}\Big[\max\{X_1, \ldots, X_n\}\Big] &\le \frac{\log n}{\lambda} + \frac{\lambda\sigma^2}{2} \\
&\le \frac{\sigma \log n}{\sqrt{2 \log n}} + \frac{\sigma^2\sqrt{2 \log n}}{2 \sigma} \\
&\le \frac{\sqrt{2 \sigma^2 \log n}}{2} + \frac{\sqrt{2 \sigma^2 \log n}}{2} \\
\end{align} %]]></script>
<p>which proves the desired claim.</p>
<p>I have to reiterate that this problem is easier than the others we’ve done in
STAT 210B, and I’m sure that over 90 percent of the students in the class could
do this just as easily as I could. But this problem makes clear the <em>techniques</em>
that are often used in theoretical statistics nowadays, so at minimum students
should have a firm grasp of the content in this blog post.</p>
<p><strong>Update April 23, 2017</strong>: In an earlier version of this post, I made an error
with taking a maximum outside of an expectation. I have fixed this post. Thanks
to <a href="https://www.stat.berkeley.edu/~blfang/">Billy Fang</a> for letting me know about this.</p>
Sat, 22 Apr 2017 13:00:00 0700
https://danieltakeshi.github.io/2017/04/22/followingprofessormichaeljordansadviceyourbrainneedsexercise
https://danieltakeshi.github.io/2017/04/22/followingprofessormichaeljordansadviceyourbrainneedsexerciseWhat I Wish People Would Say About Diversity<p>The two mainstream newspapers that I read the most, <em>The New York Times</em> and
<em>The Wall Street Journal</em>, both have recent articles about diversity and the
tech industry, a topic which by now has considerable and welldeserved
attention.</p>
<p>The <a href="https://www.nytimes.com/2017/04/02/business/dealbook/facebookpushesoutsidelawfirmstobecomemorediverse.html?_r=0">New York Times article</a> starts out with:</p>
<blockquote>
<p>Like other Silicon Valley giants, Facebook has faced criticism over whether
its work force and board are too white and too male. Last year, the social
media behemoth started a new push on diversity in hiring and retention.</p>
</blockquote>
<blockquote>
<p>Now, it is extending its efforts into another corner: the outside lawyers who
represent the company in legal matters.</p>
</blockquote>
<blockquote>
<p>Facebook is requiring that women and ethnic minorities account for at least 33
percent of law firm teams working on its matters.</p>
</blockquote>
<p>The <a href="https://www.wsj.com/articles/googlepaysfemaleworkerslessthanmalecounterpartslabordepartmentsays1491622997">Wall Street Journal article</a> says:</p>
<blockquote>
<p>The tech industry has been under fire for years over the large percentage of
white and Asian male employees and executives. Tech firms have started
initiatives to try to combat the trend, but few have shown much progress.</p>
</blockquote>
<blockquote>
<p>The industry is now under scrutiny from the Labor Department for the issue.
The department sued software giant Oracle Corp. earlier this year for
allegedly paying white male workers more than other employees. Oracle said at
the time of the suit that the complaint was politically motivated, based on
false allegations, and without merit.</p>
</blockquote>
<p>These articles discuss important issues that need to be addressed in the tech
industry. However, I would also like to gently bring up some other points that I
think should be considered in tandem.</p>
<ul>
<li>
<p>The first is to clearly identify Asians (and multiracials<sup id="fnref:blog"><a href="#fn:blog" class="footnote">1</a></sup>) as either
belonging to a minority group or not. To its credit, the Wall Street Journal
article states this when including Asians among the “large percentage of
employees”, but I often see this fact elided in favor of just “white males.”
This is a broader issue which also arises when debating about affirmative
action. Out of curiosity, I opened up the Supreme Court’s opinions on <em>Fisher
v. University of Texas at Austin</em> (<a href="https://www.supremecourt.gov/opinions/15pdf/14981_4g15.pdf">PDF link</a>) and did a search for the
word “Asians”, which appears 66 times. Only four of those instances appear in
the majority opinion written by Justice Kennedy supporting raceconscious
admission; the other 62 occurrences of “Asians” are in in Justice Alito’s
dissent.</p>
</li>
<li>
<p>The second is to suggest that there are people who have good reason to believe
that they would substantially contribute to workplace diversity, or who have
had to overcome considerable life challenges (which I argue also increases
work diversity), but who might otherwise not be considered a minority. For
instance, suppose a recent refugee from Syria with some computer programming
background applied to work at Google. If I were managing a hiring committee
and I knew of the applicant’s background information, I would be inspired and
would hold him to a slightly lower standard as other applicants, even if he
happened to be white and male. There are other possibilities, and one could
argue that poor whites or people who are disabled should qualify.</p>
</li>
<li>
<p>The third is to identify that there is a related problem in the tech industry
about the pool of qualified employees <em>to begin with</em>. If the qualified
applicants to tech jobs follow a certain distribution of the overall
population, then the most likely outcome is that the people who get hired
mirror that distribution. Thus, I would encourage emphasis on rephrasing the
argument as follows: “tech companies have been under scrutiny for having a
workforce which consists of too many white and Asian males <em>with respect to
the population distribution of qualified applicants</em>” (emphasis mine). The
words “qualified applicants” might be loaded, though. Tech companies often
filter students based on school because that is an easy and accurate way to
identify the top students, and in some schools (<a href="http://projects.dailycal.org/csgender/">such as the one I attend, for
instance</a>), the proportion of underrepresented minorities as traditionally
defined has remained stagnant for decades.</p>
</li>
</ul>
<p>I don’t want to sound insensitive to the need to make the tech workforce more
diverse. Indeed, that’s the <em>opposite</em> of what I feel, and I <em>think</em> (though I
can’t say for sure) that I would be more sensitive to the needs of
underrepresented minorities given my frequent experience of feeling like an
outcast among my classmates and colleagues.<sup id="fnref:offense"><a href="#fn:offense" class="footnote">2</a></sup> I just hope that my
alternative perspective is compatible with increasing diversity and can work
alongside — rather than against — the prevailing view.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:blog">
<p>See my <a href="https://danieltakeshi.github.io/2017/03/11/whatbiracialpeopleknow/">earlier blog post about this</a>. <a href="#fnref:blog" class="reversefootnote">↩</a></p>
</li>
<li id="fn:offense">
<p>I also take offense at the stereotype of the computer scientist as a
“shy, nerdy, antisocial male” and hope that it gets eradicated. I invite the
people espousing this stereotype to live in my shoes for a day. <a href="#fnref:offense" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 08 Apr 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/04/08/whatiwishpeoplewouldsayaboutdiversity/
https://danieltakeshi.github.io/2017/04/08/whatiwishpeoplewouldsayaboutdiversity/Sir Tim BernersLee Wins the Turing Award<p>The news is out that <a href="http://amturing.acm.org/award_winners/bernerslee_8087960.cfm">Sir Tim BernersLee has won the 2016 Turing Award</a>, the
highest honor in computer science. (Turing Award winners are usually announced a
few months <em>after</em> the actual year of the award.) He is best known for inventing
the World Wide Web, as clearly highlighted by the ACM’s citation:</p>
<blockquote>
<p>For inventing the World Wide Web, the first web browser, and the fundamental
protocols and algorithms allowing the Web to scale.</p>
</blockquote>
<p>(You can also find more information about some of his work <a href="https://www.w3.org/People/BernersLee/">on his personal
website</a>, where he has some helpful FAQs.)</p>
<p>My first reaction to reading the news was: <em>he didn’t already have a Turing
Award</em>?!? I actually thought he had been a cowinner with Vinton Cerf and Robert
Kahn, but nope. At least he’s won it now, so <a href="https://www.quora.com/WhyhasntTimBernersLeebeenawardedaTuringAwardyet">we won’t be asking Quora posts
like this one anymore</a>.</p>
<p>I’m rather surprised that this announcement wasn’t covered by many mainstream
newspapers. I tried searching for something in the New York Times, but nothing
showed up. This is rather a shame, because if we think of <em>inventing the World
Wide Web</em> as the “bar” for the Turing Award, then that’s a pretty high bar.</p>
<p>My prediction for the winner was actually Geoffrey Hinton, but I can’t argue
with Sir Tim BernersLee. (Thus, Hinton is going to be my prediction for the
2017 award.) Just like Terrence Tao for the Fields Medalist, Steven Weinberg for
the Nobel Prize in Physics, Merrick Garland for the Supreme Court, and so on,
they’re so utterly qualified that I can’t think of a reason to oppose them.</p>
Thu, 06 Apr 2017 00:00:00 0700
https://danieltakeshi.github.io/2017/04/06/sirtimbernersleewinstheturingaward/
https://danieltakeshi.github.io/2017/04/06/sirtimbernersleewinstheturingaward/Notes on the Generalized Advantage Estimation Paper<p>This post serves as a continuation of <a href="https://danieltakeshi.github.io/2017/03/28/goingdeeperintoreinforcementlearningfundamentalsofpolicygradients/">my last post on the fundamentals of
policy gradients</a>. Here, I continue it by discussing the <em>Generalized
Advantage Estimation</em> (<a href="https://arxiv.org/abs/1506.02438">arXiv link</a>) paper from ICLR 2016, which presents and
analyzes more sophisticated forms of policy gradient methods.</p>
<p>Recall that raw policy gradients, while unbiased, have <em>high variance</em>. This
paper proposes ways to dramatically reduce variance, but this unfortunately
comes at the cost of introducing bias, so one needs to be careful before
applying tricks like this in practice.</p>
<p>The setting is the usual one which I presented in my last post, and we are
indeed trying to maximize the sum of rewards (assume no discount). I’m happy
that the paper includes a concise set of notes summarizing policy gradients:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/gae_paper_pg_basics.png" alt="policy_gradients" /> </p>
<p>If the above is not 100% clear to you, I recommend reviewing the basics of policy
gradients. I covered five of the six forms of the <script type="math/tex">\Psi_t</script> function in my last
post; the exception is the temporal difference residual, but I will go over
these later here.</p>
<p>Somewhat annoyingly, they use the infinitehorizon setting. I find it easier to
think about the <em>finite</em> horizon case, and I will clarify if I’m assuming that.</p>
<h1 id="proposition1gammajustestimators">Proposition 1: <script type="math/tex">\gamma</script>Just Estimators.</h1>
<p>One of the first things they prove is Proposition 1, regarding “<script type="math/tex">\gamma</script>just”
advantage estimators. (The word “just” seems like an odd choice here, but I’m
not complaining.) Suppose <script type="math/tex">\hat{A}_t(s_{0:\infty},a_{0:\infty})</script> is an
estimate of the advantage function. A <script type="math/tex">\gamma</script>just estimator (of the
advantage function) results in</p>
<script type="math/tex; mode=display">\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[\hat{A}_t(s_{0:\infty},a_{0:\infty}) \nabla_\theta \log \pi_{\theta}(a_ts_t)\right]=
\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[A^{\pi,\gamma}(s_{0:\infty},a_{0:\infty}) \nabla_\theta \log \pi_{\theta}(a_ts_t)\right]</script>
<p>This is for <em>one time step</em> <script type="math/tex">t</script>. If we sum over all time steps, by linearity
of expectation we get</p>
<script type="math/tex; mode=display">\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[\sum_{t=0}^\infty \hat{A}_t(s_{0:\infty},a_{0:\infty}) \nabla_\theta \log \pi_{\theta}(a_ts_t)\right]=
\mathbb{E}_{s_{0:\infty},a_{0:\infty}}\left[\sum_{t=0}^\infty A^{\pi,\gamma}(s_t,a_t)\nabla_\theta \log \pi_{\theta}(a_ts_t)\right]</script>
<p>In other words, we get an <em>unbiased</em> estimate of the discounted gradient. Note,
however, that this discounted gradient is <em>different</em> from the gradient of the
actual function we’re trying to optimize, since that was for the <em>undiscounted</em>
rewards. The authors emphasize this in a footnote, saying that they’ve <em>already</em>
introduced bias by even assuming the use of a discount factor. (I’m somewhat
pleased at myself for catching this in advance.)</p>
<p>The proof for Proposition 1 is based on proving it for one time step <script type="math/tex">t</script>,
which is all that is needed. The resulting term with <script type="math/tex">\hat{A}_t</script> in it splits
into two terms due to linearity of expectation, one with the <script type="math/tex">Q_t</script> function
and another with the baseline. The second term is zero due to the baseline
causing the expectation to zero, which I derived in my previous post in the
finitehorizon case. (I’m not totally sure how to do this in the infinite
horizon case, due to technicalities involving infinity.)</p>
<p>The first term is unfortunately a little more complicated. Let me use the finite
horizon <script type="math/tex">T</script> for simplicity so that I can easily write out the definition. They
argue in the proof that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\mathbb{E}_{s_{0:T},a_{0:T}}\left[ \nabla_\theta \log \pi_{\theta}(a_ts_t) \cdot Q_t(s_{0:T},a_{0:T})\right] \\
&= \mathbb{E}_{s_{0:t},a_{0:t}}\left[ \nabla_\theta \log \pi_{\theta}(a_ts_t)\cdot \mathbb{E}_{s_{t+1:T},a_{t+1:T}}\Big[Q_t(s_{0:T},a_{0:T})\Big]\right] \\
&= \int_{s_0}\cdots \int_{s_t}\int_{a_t}\Bigg[ p_\theta((s_0,\ldots,s_t,a_t)) \nabla_\theta \log \pi_{\theta}(a_ts_t) \cdot \mathbb{E}_{s_{t+1:T},a_{t+1:T}}\Big[ Q_t(s_{0:T},a_{0:T}) \Big]\Bigg] d\mu(s_0,\ldots,s_t,a_t)\\
\;&{\overset{(i)}{=}}\; \int_{s_0}\cdots \int_{s_t} \left[ p_\theta((s_0,\ldots,s_t)) \nabla_\theta \log \pi_{\theta}(a_ts_t) \cdot A^{\pi,\gamma}(s_t,a_t)\right] d\mu(s_0,\ldots,s_t)
\end{align} %]]></script>
<p>Most of this proceeds by definitions of expectations and then “pushing”
integrals into their appropriate locations. Unfortunately, I am unable to
figure out how they did step (i). Specifically, I don’t see how the integral
over <script type="math/tex">a_t</script> somehow “moves past” the <script type="math/tex">\nabla_\theta \log \pi_\theta(a_ts_t)</script>
term. Perhaps there is some trickery with the law of iterated expectation due to
conditionals? If anyone else knows why and is willing to explain with detailed
math somewhere, I would really appreciate it.</p>
<p>For now, I will assume this proposition to be true. It is useful because if we
are given the form of estimator <script type="math/tex">\hat{A}_t</script> of the advantage, we can
immediately tell if it is an unbiased advantage estimator.</p>
<h1 id="advantagefunctionestimators">Advantage Function Estimators</h1>
<p>Now assume we have some function <script type="math/tex">V</script> which attempts to approximate the true
value function <script type="math/tex">V^\pi</script> (or <script type="math/tex">V^{\pi,\gamma}</script> in the undiscounted setting).</p>
<ul>
<li>
<p><strong>Note I</strong>: <script type="math/tex">V</script> is <em>not</em> the true value function. It is only our estimate of
it, so <script type="math/tex">V_\phi(s_t) \approx V^\pi(s_t)</script>. I added in the <script type="math/tex">\phi</script> subscript
to indicate that we use a function, such as a neural network, to approximate
the value. The weights of the neural network are entirely specified by
<script type="math/tex">\phi</script>.</p>
</li>
<li>
<p><strong>Note II</strong>: we <em>also</em> have our policy <script type="math/tex">\pi_\theta</script> parameterized by
parameters <script type="math/tex">\theta</script>, again typically a neural network. For now, assume
that <script type="math/tex">\phi</script> and <script type="math/tex">\theta</script> are <em>separate</em> parameters; the authors mention
some enticing future work where one can <em>share</em> parameters and jointly
optimize. The combination of <script type="math/tex">\pi_{\theta}</script> and <script type="math/tex">V_{\phi}</script> with a policy
estimator and a value function estimator is known as the <strong>actorcritic</strong>
model with the policy as the actor and the value function as the critic. (I
don’t know why it’s called a “critic” because the value function acts more
like an “assistant”.)</p>
</li>
</ul>
<p>Using <script type="math/tex">V</script>, we can derive a <em>class</em> of advantage function estimators as
follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{A}_t^{(1)} &= r_t + \gamma V(s_{t+1})  V(s_t) \\
\hat{A}_t^{(2)} &= r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})  V(s_t) \\
\cdots &= \cdots \\
\hat{A}_t^{(\infty)} &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots  V(s_t)
\end{align} %]]></script>
<p>These take on the form of temporal difference estimators where we first estimate
the sum of discounted rewards and then we subtract the value function estimate
of it. <strong>If</strong> <script type="math/tex">V = V^{\pi,\gamma}</script>, meaning that <script type="math/tex">V</script> is exact, then all of
the above are unbiased estimates for the advantage function. In practice, this
will not be the case, since we are not given the value function.</p>
<p>The <em>tradeoff</em> here is that the estimators <script type="math/tex">\hat{A}_t^{(k)}</script> with small <script type="math/tex">k</script>
have <strong>low variance but high bias</strong>, whereas those with large <script type="math/tex">k</script> have <strong>low
bias but high variance</strong>. Why? I think of it based on the number of terms. With
small <script type="math/tex">k</script>, we have fewer terms to sum over (which means low variance).
However, the bias is relatively large because it does not make use of extra
“exact” information with <script type="math/tex">r_K</script> for <script type="math/tex">K > k</script>. Here’s another way to think of
it as emphasized in the paper: <script type="math/tex">V(s_t)</script> is constant among the estimator class,
so it does not affect the relative bias or variance among the estimators:
differences arise entirely due to the <script type="math/tex">k</script>step returns.</p>
<p>One might wonder, as I originally did, how to make use of the <script type="math/tex">k</script>step returns
in practice. In Qlearning, we have to update the parameters (or the <script type="math/tex">Q(s,a)</script>
“table”) after each current reward, right? The key is to let the agent run for
<script type="math/tex">k</script> steps, and <em>then</em> update the parameters based on the returns. The reason
why we update parameters “immediately” in ordinary Qlearning is simply due to
the <em>definition</em> of Qlearning. With longer returns, we have to keep the
Qvalues fixed until the agent has explored more. This is also emphasized in the
A3C paper from DeepMind, where they talk about <script type="math/tex">n</script>step Qlearning.</p>
<h1 id="thegeneralizedadvantageestimator">The Generalized Advantage Estimator</h1>
<p>It might not be so clear which of these estimators above is the most useful. How
can we compute the bias and variance?</p>
<p>It turns out that it’s better to use <em>all</em> of the estimators, in a clever way.
First, define the temporal difference residual <script type="math/tex">\delta_t^V = r_t + \gamma
V(s_{t+1})  V(s_t)</script>. Now, here’s how the <strong>Generalized Advantage Estimator</strong>
<script type="math/tex">\hat{A}_t^{GAE(\gamma,\lambda)}</script> is defined:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{A}_t^{GAE(\gamma,\lambda)} &= (1\lambda)\Big(\hat{A}_{t}^{(1)} + \lambda \hat{A}_{t}^{(2)} + \lambda^2 \hat{A}_{t}^{(3)} + \cdots \Big) \\
&= (1\lambda)\Big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V)+ \cdots \Big) \\
&= (1\lambda)\Big( \delta_t^V(1+\lambda+\lambda^2+\cdots) + \gamma\delta_{t+1}^V(\lambda+\lambda^2+\cdots) + \cdots \Big) \\
&= (1\lambda)\left(\delta_t^V \frac{1}{1\lambda} + \gamma \delta_{t+1}^V\frac{\lambda}{1\lambda} + \cdots\right) \\
&= \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^{V}
\end{align} %]]></script>
<p>To derive this, one simply expands the definitions and uses the geometric series
formula. The result is interesting to interpret: <em>the exponentiallydecayed sum
of residual terms</em>.</p>
<p>The above describes the estimator <script type="math/tex">GAE(\gamma, \lambda)</script> for <script type="math/tex">\lambda \in
[0,1]</script> where adjusting <script type="math/tex">\lambda</script> adjusts the biasvariance tradeoff. We
usually have <script type="math/tex">{\rm Var}(GAE(\gamma, 1)) > {\rm Var}(GAE(\gamma, 0))</script> due to
the number of terms in the summation (more terms usually means higher variance),
but the bias relationship is reversed. The other parameter, <script type="math/tex">\gamma</script>, <em>also</em>
adjusts the biasvariance tradeoff … but for the GAE analysis it seems like
the <script type="math/tex">\lambda</script> part is more important. Admittedly, it’s a bit confusing why we
need to have both <script type="math/tex">\gamma</script> and <script type="math/tex">\lambda</script> (after all, we can absorb them into
one constant, right?) but as you can see, the constants serve different roles in
the GAE formula.</p>
<p>To make a long story short, we can put the GAE in the policy gradient estimate
and we’ve got our biased estimate (unless <script type="math/tex">\lambda=1</script>) of the discounted
gradient, which again, is <em>itself</em> biased due to the discount. Will this work
well in practice? Stay tuned …</p>
<h1 id="rewardshapinginterpretation">Reward Shaping Interpretation</h1>
<p><strong>Reward shaping</strong> originated from a 1999 ICML paper, and refers to the
technique of transforming the original reward function <script type="math/tex">r</script> into a new one
<script type="math/tex">\tilde{r}</script> via the following transformation with <script type="math/tex">\Phi: \mathcal{S} \to
\mathbb{R}</script> an arbitrary realvalued function on the state space:</p>
<script type="math/tex; mode=display">\tilde{r}(s,a,s') = r(s,a,s') + \gamma \Phi(s')  \Phi(s)</script>
<p>Amazingly, it was shown that despite how <script type="math/tex">\Phi</script> is arbitrary, the reward
shaping transformation results in the <em>same optimal policy and optimal policy
gradient</em>, at least when the objective is to maximize discounted rewards
<script type="math/tex">\sum_{t=0}^\infty \gamma^t r(s_t,a_t,s_{t+1})</script>. I am not sure whether the
same is true with the undiscounted case as they have here, but it seems like it
should since we can set <script type="math/tex">\gamma=1</script>.</p>
<p>The more important benefit for their purposes, it seems, is that this reward
shaping leaves the advantage function invariant for any policy. The word
“invariant” here means that if we computed the advantage function
<script type="math/tex">A^{\pi,\gamma}</script> for a policy and a discount factor in some MDP, the
<em>transformed</em> MDP would have some advantage function <script type="math/tex">\tilde{A}^{\pi,\gamma}</script>,
but we would have <script type="math/tex">A^{\pi,\gamma} = \tilde{A}^{\pi,\gamma}</script> (nice!). This
follows because if we consider the discounted sum of rewards starting at state
<script type="math/tex">s_t</script> in the <em>transformed</em> MDP, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\sum_{l=0}^{\infty} \gamma^l \tilde{r}(s_{t+l},a_{t+l},s_{t+l+1}) &= \left[\sum_{l=0}^{\infty}\gamma^l r(s_{t+l},a_{t+l},s_{t+l+1})\right] + \Big( \gamma\Phi(s_{t+1})  \Phi(s_t) + \gamma^2\Phi(s_{t+2})\gamma \Phi(s_{t+1})+ \cdots\Big)\\
&= \sum_{l=0}^{\infty}\gamma^l r(s_{t+l},a_{t+l},s_{t+l+1})  \Phi(s_t)
\end{align} %]]></script>
<p>“Hitting” the above values with expectations (as Michael I. Jordan would say it)
and substituting appropriate values results in the desired
<script type="math/tex">\tilde{A}^{\pi,\gamma}(s_t,a_t) = A^{\pi,\gamma}(s_t,a_t)</script> equality.</p>
<p>The connection between reward shaping and the GAE is the following: suppose we
are trying to find a good policy gradient estimate for the transformed MDP. If
we try to maximize the sum of <script type="math/tex">(\gamma \lambda)</script>discounted sum of
(transformed) rewards and set <script type="math/tex">\Phi = V</script>, we get precisely the GAE! With <script type="math/tex">V</script>
here, we have <script type="math/tex">\tilde{r}(s_t,a_t,s_{t+1}) = \delta_t^V</script>, the residual term
defined earlier.</p>
<p>To analyze the tradeoffs with <script type="math/tex">\gamma</script> and <script type="math/tex">\lambda</script>, they use a <em>response
function</em>:</p>
<script type="math/tex; mode=display">\chi(l; s_t,a_t) := \mathbb{E}[r_{l+t} \mid s_t,a_t]  \mathbb{E}[r_{l+t} \mid s_t]</script>
<p>Why is this important? They state it clearly:</p>
<blockquote>
<p>The response function lets us quantify the temporal credit assignment problem:
long range dependencies between actions and rewards correspond to nonzero
values of the response function for <script type="math/tex">l \gg 0</script>.</p>
</blockquote>
<p>These “longrange dependencies” are the most challenging part of the credit
assignment problem. Then here’s the kicker: they argue that if <script type="math/tex">\Phi =
V^{\pi,\gamma}</script>, then the transformed rewards are such that
<script type="math/tex">\mathbb{E}[\tilde{r}_{l+t} \mid s_t,a_t]  \mathbb{E}[\tilde{r}_{l+t} \mid
s_t] = 0</script> for <script type="math/tex">l>0</script>. Thus, longrange rewards have to induce an immediate
response! I’m admittedly not totally sure if I understand this, and it seems odd
that we only want the response function to be nonzero at the current time (I
mean, some rewards <em>have</em> to be <em>merely</em> a few steps in the future, right?). I
will take another look at this section if I have time.</p>
<h1 id="valuefunctionestimation">Value Function Estimation</h1>
<p>In order to be able to <em>use</em> the GAE in our policy gradient algorithm (again,
this means computing gradients and shifting the weights of the policy to
maximize an objective), we need some value function <script type="math/tex">V_\phi</script> parameterized by
a neural network. This is part of the <strong>actorcritic</strong> framework, where the
“critic” provides the value function estimate.</p>
<p>Let <script type="math/tex">\hat{V}_t = \sum_{l=0}^\infty \gamma^l r_{t+l}</script> be the discounted sum of
rewards. The authors propose the following optimization procedure to find the
best weights <script type="math/tex">\phi</script>:</p>
<script type="math/tex; mode=display">{\rm minimize}_\phi \quad \sum_{n=1}^N\V_\phi(s_n)  \hat{V}_n\_2^2</script>
<script type="math/tex; mode=display">\mbox{subject to} \quad \frac{1}{N}\sum_{n=1}^N\frac{\V_\phi(s_n) 
\hat{V}_{\phi_{\rm old}}(s_n)\_2^2}{2\sigma^2} \le \epsilon</script>
<p>where each iteration, <script type="math/tex">\phi_{\rm old}</script> is the parameter vector before the
update, and</p>
<script type="math/tex; mode=display">\sigma^2 = \frac{1}{N}\sum_{n=1}^N\V_{\phi_{\rm old}}(s_n)\hat{V}_n\_2^2</script>
<p>This is a <em>constrained optimization</em> problem to find the best weights for the
value function. The constraint reminds me of Trust Region Policy Optimization,
because it limits the amount that <script type="math/tex">\phi</script> can change from one update to
another. The advantages with a “trust region” method are that the weights don’t
change too much and that they don’t overfit to the current batch. (Updates are
done in <em>batch</em> mode, which is standard nowadays.)</p>
<ul>
<li>
<p><strong>Note I</strong>: unfortunately, the authors don’t use this optimization procedure
exactly. They use a <em>conjugate gradient</em> method to approximate it. But think
of the optimization procedure here since it’s easier to understand and is
“ideal.”</p>
</li>
<li>
<p><strong>Note II</strong>: remember that this is <em>not</em> the update to the policy
<script type="math/tex">\pi_\theta</script>. That update requires an entirely separate optimization
procedure. Don’t get confused between the two. Both the policy and the value
functions can be implemented as neural networks, and in fact, that’s what the
authors do. They actually have the same architecture, with the exception of
the output layer since the value only needs a scalar, whereas the policy needs
a higherdimensional output vector.</p>
</li>
</ul>
<h1 id="puttingitalltogether">Putting it All Together</h1>
<p>It’s nice to understand each of the components above, but how do we combine them
into an <em>actual algorithm</em>? Here’s a rough description of their proposed
actorcritic algorithm, each iteration:</p>
<ul>
<li>
<p>Simulate the current policy to collect data.</p>
</li>
<li>
<p>Compute the Bellman residuals <script type="math/tex">\delta_{t}^V</script>.</p>
</li>
<li>
<p>Compute the advantage function estimate <script type="math/tex">\hat{A}_t</script>.</p>
</li>
<li>
<p>Update the policy’s weights, <script type="math/tex">\theta_{i+1}</script>, with a TRPO update.</p>
</li>
<li>
<p>Update the critic’s weights, <script type="math/tex">\phi_{i+1}</script>, with a trustregion update.</p>
</li>
</ul>
<p>As usual, here are a few of my overlydetailed comments (sorry again):</p>
<ul>
<li>
<p><strong>Note I</strong>: Yes, there are trust region methods for <em>both</em> the value function
update and the policy function update. This is one of their contributions.
(To be clear, the notion of a “GAE” isn’t entirely their contribution.) The
value and policy are also both neural networks with the same architecture
except for the output since they have different outputs. Honestly, it seems
like we should <em>always</em> be thinking about trust region methods whenever we
have some optimization to do.</p>
</li>
<li>
<p><strong>Note II</strong>: If you’re confused by the role of the two networks, repeat this
to yourself: the policy network is for determining actions, and the value
network is for improving the performance of the gradient update (which is used
to improve the actual policy by pointing the gradient in the correct
direction!).</p>
</li>
</ul>
<p>They present some impressive experimental benchmarks using this actorcritic
algorithm. I don’t have too much experience with MuJoCo so I can’t intuitively
think about the results that much. (I’m also surprised that MuJoCo isn’t free
and requires payment; it must be by far the best physic simulator for
reinforcement learning, otherwise people wouldn’t be using it.)</p>
<h1 id="concludingthoughts">Concluding Thoughts</h1>
<p>I didn’t understand the implications of this paper when I read it for the first
time (maybe more than a year ago!) but it’s becoming clearer now. They present
and analyze a specific kind of estimator, the GAE, which has a biasvariance
“knob” with the <script type="math/tex">\lambda</script> (and <script type="math/tex">\gamma</script>, technically). By adjusting the
knob, it might be possible to get low variance, low biased estimates, which
would drastically improve the sample efficiency of policy gradient methods. They
also present a way to estimate the value method using a trust region method.
With these components, they are able to achieve high performance on challenging
reinforcement learning tasks with continuous control.</p>
Sat, 01 Apr 2017 23:00:00 0700
https://danieltakeshi.github.io/2017/04/02/notesonthegeneralizedadvantageestimationpaper/
https://danieltakeshi.github.io/2017/04/02/notesonthegeneralizedadvantageestimationpaper/Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients<p>As I stated <a href="https://danieltakeshi.github.io/2017/03/23/keepingtrackofresearcharticlesmypapernotesrepository/">in my last blog post</a>, I am feverishly trying to read more
research papers. One category of papers that seems to be coming up a lot
recently are those about <em>policy gradients</em>, which are a popular class of
reinforcement learning algorithms which estimate a gradient for a function
approximator. Thus, the purpose of this blog post is for me to explicitly write
the mathematical foundations for policy gradients so that I can gain
understanding. In turn, I hope some of my explanations will be useful to a
broader audience of AI students.</p>
<h1 id="assumptionsandproblemstatement">Assumptions and Problem Statement</h1>
<p>In any type of research domain, we always have to make some set of assumptions.
(By “we”, I refer to the researchers who write papers on this.) With
reinforcement learning and policy gradients, the assumptions usually mean the
<strong>episodic</strong> setting where an agent engages in multiple <strong>trajectories</strong> in its
environment. As an example, an agent could be playing a game of Pong, so one
episode or trajectory consists of a full starttofinish game.</p>
<p>We define a trajectory <script type="math/tex">\tau</script> of length <script type="math/tex">T</script> as</p>
<script type="math/tex; mode=display">\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T1}, a_{T1}, r_{T1}, s_T)</script>
<p>where <script type="math/tex">s_0</script> comes from the starting distribution of states, <script type="math/tex">a_i \sim
\pi_\theta(a_i s_i)</script>, and <script type="math/tex">s_i \sim P(s_i  s_{i1},a_{i1})</script> with <script type="math/tex">P</script> the
dynamics model (i.e. how the environment changes). We actually <em>ignore</em> the
dynamics when optimizing, since all we care about is getting a good gradient
signal for <script type="math/tex">\pi_\theta</script> to make it better. If this isn’t clear now, it will be
clear soon. Also, the reward can be computed from the states and actions, since
it’s usually a function of <script type="math/tex">(s_i,a_i,s_{i+1})</script>, so it’s not technically needed
in the trajectory.</p>
<p>What’s our <em>goal</em> here with policy gradients? Unlike algorithms such as DQN,
which strive to find an excellent policy indirectly through Qvalues, policy
gradients perform a <em>direct</em> gradient update on a policy to change its
parameters, which is what makes it so appealing. Formally, we have:</p>
<script type="math/tex; mode=display">{\rm maximize}_{\theta}\; \mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T1}\gamma^t r_t\right]</script>
<ul>
<li>
<p><strong>Note I</strong>: I put <script type="math/tex">\pi_{\theta}</script> under the expectation. This means the
rewards are computed from a trajectory which was generated under the policy
<script type="math/tex">\pi_\theta</script>. We have to <em>find</em> “optimal” settings of <script type="math/tex">\theta</script> to make
this work.</p>
</li>
<li>
<p><strong>Note II</strong>: we don’t need to optimize the expected sum of discounted rewards,
though it’s the formulation I’m most used to. Alternatives include ignoring
<script type="math/tex">\gamma</script> by setting it to one, extending <script type="math/tex">T</script> to infinity if the episodes
are infinitehorizon, and so on.</p>
</li>
</ul>
<p>The above raises the allimportant question: <em>how do we find the best
<script type="math/tex">\theta</script></em>? If you’ve taken optimization classes before, you should know the
answer already: perform gradient ascent on <script type="math/tex">\theta</script>, so we have <script type="math/tex">\theta
\leftarrow \theta + \alpha \nabla f(x)</script> where <script type="math/tex">f(x)</script> is the function being
optimized. Here, that’s the expected value of whatever sum of rewards formula
we’re using.</p>
<h1 id="twostepslogderivativetrickanddetermininglogprobability">Two Steps: LogDerivative Trick and Determining Log Probability</h1>
<p>Before getting to the computation of the gradient, let’s first review two
mathematical facts which will be used later, and which are also of independent interest.
The first is the “logderivative” trick, which tells us how to insert a log into
an expectation when starting from <script type="math/tex">\nabla_\theta \mathbb{E}[f(x)]</script>.
Specifically, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}[f(x)] &= \nabla_\theta \int p_\theta(x)f(x)dx \\
&= \int \frac{p_\theta(x)}{p_\theta(x)} \nabla_\theta p_\theta(x)f(x)dx \\
&= \int p_\theta(x)\nabla_\theta \log p_\theta(x)f(x)dx \\
&= \mathbb{E}\Big[f(x)\nabla_\theta \log p_\theta(x)\Big]
\end{align} %]]></script>
<p>where <script type="math/tex">p_\theta</script> is the density of <script type="math/tex">x</script>. Most of these steps should be
straightforward. The main technical detail to worry about is exchanging the
gradient with the integral. I have never been comfortable in knowing when we are
allowed to do this or not, but since everyone else does this, I will follow
them.</p>
<p>Another technical detail we will need is the gradient of the log probability of
a <em>trajectory</em> since we will later switch <script type="math/tex">x</script> from above with a trajectory
<script type="math/tex">\tau</script>. The computation of <script type="math/tex">\log p_\theta(\tau)</script> proceeds as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \log p_\theta(\tau) &= \nabla \log \left(\mu(s_0) \prod_{t=0}^{T1} \pi_\theta(a_ts_t)P(s_{t+1}s_t,a_t)\right) \\
&= \nabla_\theta \left[\log \mu(s_0)+ \sum_{t=0}^{T1} (\log \pi_\theta(a_ts_t) + \log P(s_{t+1}s_t,a_t)) \right]\\
&= \nabla_\theta \sum_{t=0}^{T1}\log \pi_\theta(a_ts_t)
\end{align} %]]></script>
<p>The probability of <script type="math/tex">\tau</script> decomposes into a chain of probabilities by the
Markov Decision Process assumption, whereby the next action only depends on the
current state, and the next state only depends on the current state and action.
To be explicit, we use the functions that we already defined: <script type="math/tex">\pi_\theta</script> and
<script type="math/tex">P</script> for the policy and dynamics, respectively. (Here, <script type="math/tex">\mu</script> represents the
starting state distribution.) We also observe that when taking gradients, the
dynamics disappear!</p>
<h1 id="computingtherawgradient">Computing the Raw Gradient</h1>
<p>Using the two tools above, we can now get back to our original goal, which was
to compute the gradient of the expected sum of (discounted) rewards. Formally,
let <script type="math/tex">R(\tau)</script> be the reward function we want to optimize (i.e. maximize).
Using the above two tricks, we obtain:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim
\pi_\theta} \left[R(\tau) \cdot \nabla_\theta \left(\sum_{t=0}^{T1}\log
\pi_\theta(a_ts_t)\right)\right]</script>
<p>In the above, the expectation is with respect to the policy function, so think
of it as <script type="math/tex">\tau \sim \pi_\theta</script>. In practice, we need trajectories to get an
empirical expectation, which estimates this actual expectation.</p>
<p>So that’s the gradient! Unfortunately, we’re not quite done yet. The naive way
is to run the agent on a batch of episodes, get a set of trajectories (call it
<script type="math/tex">\hat{\tau}</script>) and update with <script type="math/tex">\theta \leftarrow \theta + \alpha
\nabla_\theta \mathbb{E}_{\tau \in \hat{\tau}}[R(\tau)]</script> using the empirical
expectation, but this will be too slow and unreliable due to high variance on
the gradient estimates. After one batch, we may exhibit a wide range of results:
much better performance, equal performance, or <em>worse</em> performance. The high
variance of these gradient estimates is precisely why there has been so much
effort devoted to variance reduction techniques. (I should also add from
personal research experience that variance reduction is certainly not limited to
reinforcement learning; it also appears in many statistical projects which
concern a biasvariance tradeoff.)</p>
<h1 id="howtointroduceabaseline">How to Introduce a Baseline</h1>
<p>The standard way to reduce the variance of the above gradient estimates is to
insert a <strong>baseline function</strong> <script type="math/tex">b(s_t)</script> inside the expectation.</p>
<p>For concreteness, assume <script type="math/tex">R(\tau) = \sum_{t=0}^{T1}r_t</script>, so we have no
discounted rewards. We can express the policy gradient in three equivalent, but
perhaps nonintuitive ways:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}\Big[R(\tau)\Big] \;&{\overset{(i)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[\left(\sum_{t=0}^{T1}r_t\right) \cdot \nabla_\theta \left(\sum_{t=0}^{T1}\log \pi_\theta(a_ts_t)\right)\right] \\
&{\overset{(ii)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t'=0}^{T1} r_{t'} \sum_{t=0}^{t'}\nabla_\theta \log \pi_\theta(a_ts_t)\right] \\
&{\overset{(iii)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}r_{t'}\right) \right]
\end{align} %]]></script>
<p>Comments:</p>
<ul>
<li>
<p><strong>Step (i)</strong> follows from plugging in our chosen <script type="math/tex">R(\tau)</script> into the policy
gradient we previously derived.</p>
</li>
<li>
<p><strong>Step (ii)</strong> follows from first noting that <script type="math/tex">\nabla_\theta
\mathbb{E}_{\tau}\Big[r_{t'}\Big] = \mathbb{E}_\tau\left[r_{t'} \cdot
\sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_ts_t)\right]</script>. The reason
why this is true can be somewhat tricky to identify. I find it easy to think
of just redefining <script type="math/tex">R(\tau)</script> as <script type="math/tex">r_{t'}</script> for some fixed timestep <script type="math/tex">t'</script>.
Then, we do the exact same computation above to get the final result, as shown
in the equation of the “Computing the Raw Gradient” section. The main
difference now is that since we’re considering the reward at time <script type="math/tex">t'</script>, our
trajectory under expectation <em>stops</em> at that time. More concretely,
<script type="math/tex">\nabla_\theta\mathbb{E}_{(s_0,a_0,\ldots,s_{T})}\Big[r_{t'}\Big] =
\nabla_\theta\mathbb{E}_{(s_0,a_0,\ldots,s_{t'})}\Big[r_{t'}\Big]</script>. This is
like “throwing away variables” when taking expectations due to “pushing
values” through sums and summing over densities (which cancel out); I have
another example later in this post which makes this explicit.</p>
<p>Next, we sum over both sides, for <script type="math/tex">t' = 0,1,\ldots,T1</script>. Assuming we can
exchange the sum with the gradient, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] &= \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t'=0}^{T1} r_{t'}\right] \\
&= \sum_{t'=0}^{T1}\nabla_\theta \mathbb{E}_{\tau^{(t')}} \Big[r_{t'}\Big] \\
&= \sum_{t'}^{T1} \mathbb{E}_{\tau^{(t')}}\left[r_{t'} \cdot \sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_ts_t)\right] \\
&= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t'}^{T1} r_{t'} \cdot \sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_ts_t)\right].
\end{align} %]]></script>
<p>where <script type="math/tex">\tau^{(t')}</script> indicates the trajectory up to time <script type="math/tex">t'</script>. (Full
disclaimer: I’m not sure if this formalism with <script type="math/tex">\tau</script> is needed, and I
think most people would do this computation without worrying about the precise
expectation details.)</p>
</li>
<li>
<p><strong>Step (iii)</strong> follows from a nifty algebra trick. To simplify the subsequent
notation, let <script type="math/tex">f_t := \nabla_\theta \log \pi_\theta(a_ts_t)</script>. In addition,
<strong>ignore the expectation</strong>; we’ll only rearrange the inside here. With this
substitution and setup, the sum inside the expectation from <strong>Step (ii)</strong>
turns out to be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
r_0f_0 &+ \\
r_1f_0 &+ r_1f_1 + \\
r_2f_0 &+ r_2f_1 + r_2f_2 + \\
\cdots \\
r_{T1}f_0 &+ r_{T1}f_1 + r_{T1}f_2 \cdots + r_{T1}f_{T1}
\end{align} %]]></script>
<p>In other words, each <script type="math/tex">r_{t'}</script> has its own <em>row</em> of <script type="math/tex">f</script>value to which it
gets distributed. Next, <em>switch to the column view</em>: instead of summing
rowwise, sum <em>columnwise</em>. The first column is <script type="math/tex">f_0 \cdot
\left(\sum_{t=0}^{T1}r_t\right)</script>. The second is <script type="math/tex">f_1 \cdot
\left(\sum_{t=1}^{T1}r_t\right)</script>. And so on. Doing this means we get the
desired formula after replacing <script type="math/tex">f_t</script> with its real meaning and hitting the
expression with an expectation.</p>
</li>
</ul>
<p>Note: it is <em>very easy</em> to make a typo with these. I checked my math carefully
and crossreferenced it with references online (which <em>themselves</em> have typos).
If any readers find a typo, please let me know.</p>
<p>Using the above formulation, we finally introduce our baseline <script type="math/tex">b</script>, which is a
function of <script type="math/tex">s_t</script> (and <em>not</em> <script type="math/tex">s_{t'}</script>, I believe). We “insert” it inside the
term in parentheses:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] =
\mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log
\pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}r_{t'}  b(s_t)\right) \right]</script>
<p>At first glance, it doesn’t seem like this will be helpful, and one might wonder
if this would cause the gradient estimate to become biased. Fortunately, it
turns out that this is not a problem. This was surprising to me, because all we
know is that <script type="math/tex">b(s_t)</script> is a function of <script type="math/tex">s_t</script>. However, this is a bit
misleading because usually we want <script type="math/tex">b(s_t)</script> to be the <em>expected return</em>
starting at time <script type="math/tex">t</script>, which means it really “depends” on the subsequent time
steps. For now, though, just think of it as a function of <script type="math/tex">s_t</script>.</p>
<h1 id="understandingthebaseline">Understanding the Baseline</h1>
<p>In this section, I first go over why inserting <script type="math/tex">b</script> above doesn’t make
our gradient estimate biased. Next, I will go over why the baseline reduces
variance of the gradient estimate. These two capture the best of both worlds:
staying unbiased and reducing variance. In general, any time you have an
unbiased estimate and it remains so after applying a variance reduction
technique, then apply that variance reduction!</p>
<p>First, let’s show that the gradient estimate is unbiased. We see that with the
baseline, we can distribute and rearrange and get:</p>
<script type="math/tex; mode=display">\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] =
\mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log
\pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}r_{t'}\right)  \sum_{t=0}^{T1}
\nabla_\theta \log \pi_\theta(a_ts_t) b(s_t) \right]</script>
<p>Due to linearity of expectation, all we need to show is that for any single time
<script type="math/tex">t</script>, the gradient of <script type="math/tex">\log \pi_\theta(a_ts_t)</script> multiplied with <script type="math/tex">b(s_t)</script>
is zero. This is true because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}_{\tau \sim \pi_\theta}\Big[\nabla_\theta \log \pi_\theta(a_ts_t) b(s_t)\Big] &= \mathbb{E}_{s_{0:t},a_{0:t1}}\Big[ \mathbb{E}_{s_{t+1:T},a_{t:T1}} [\nabla_\theta \log \pi_\theta(a_ts_t) b(s_t)]\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t1}}\Big[ b(s_t) \cdot \underbrace{\mathbb{E}_{s_{t+1:T},a_{t:T1}} [\nabla_\theta \log \pi_\theta(a_ts_t)]}_{E}\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t1}}\Big[ b(s_t) \cdot \mathbb{E}_{a_t} [\nabla_\theta \log \pi_\theta(a_ts_t)]\Big] \\
&= \mathbb{E}_{s_{0:t},a_{0:t1}}\Big[ b(s_t) \cdot 0 \Big] = 0
\end{align} %]]></script>
<p>Here are my usual overlydetailed comments (apologies in advance):</p>
<ul>
<li>
<p><strong>Note I</strong>: this notation is similar to what I had before. The trajectory
<script type="math/tex">s_0,a_0,\ldots,a_{T1},s_{T}</script> is now represented as <script type="math/tex">s_{0:T},a_{0:T1}</script>.
In addition, the expectation is split up, which is allowed. If this is
confusing, think of the definition of the expectation with respect to at least
two variables. We can write brackets in any appropriately enclosed location.
Furthermore, we can “omit” the unnecessary variables in going from
<script type="math/tex">\mathbb{E}_{s_{t+1:T},a_{t:T1}}</script> to <script type="math/tex">\mathbb{E}_{a_t}</script> (see expression
<script type="math/tex">E</script> above). Concretely, assuming we’re in discreteland with actions in
<script type="math/tex">\mathcal{A}</script> and states in <script type="math/tex">\mathcal{S}</script>, this is because <script type="math/tex">E</script> evaluates
to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
E &= \sum_{a_t\in \mathcal{A}}\sum_{s_{t+1}\in \mathcal{S}}\cdots \sum_{s_T\in \mathcal{S}} \underbrace{\pi_\theta(a_ts_t)P(s_{t+1}s_t,a_t) \cdots P(s_Ts_{T1},a_{T1})}_{p((a_t,s_{t+1},a_{t+1}, \ldots, a_{T1},s_{T}))} (\nabla_\theta \log \pi_\theta(a_ts_t)) \\
&= \sum_{a_t\in \mathcal{A}} \pi_\theta(a_ts_t)\nabla_\theta \log \pi_\theta(a_ts_t) \sum_{s_{t+1}\in \mathcal{S}} P(s_{t+1}s_t,a_t) \sum_{a_{t+1}\in \mathcal{A}}\cdots \sum_{s_T\in \mathcal{S}} P(s_Ts_{T1},a_{T1})\\
&= \sum_{a_t\in \mathcal{A}} \pi_\theta(a_ts_t)\nabla_\theta \log \pi_\theta(a_ts_t)
\end{align} %]]></script>
<p>This is true because of the definition of expectation, whereby we get the
joint density over the entire trajectory, and then we can split it up like we
did earlier with the gradient of the log probability computation. We can
distribute <script type="math/tex">\nabla_\theta \log \pi_\theta(a_ts_t)</script> all the way back to (but
not beyond) the first sum over <script type="math/tex">a_t</script>. Pushing sums “further back” results in
a bunch of sums over densities, each of which sums to one. The astute reader
will notice that this is precisely what happens with <a href="https://danieltakeshi.github.io/20150712notesonexactinferenceingraphicalmodels/">variable elimination for
graphical models</a>. (The more technical reason why “pushing values back
through sums” is allowed has to do with abstract algebra properties of the sum
function, which is beyond the scope of this post.)</p>
</li>
<li>
<p><strong>Note II</strong>: This proof above also works with an infinitetime horizon. In
Appendix B of the <em>Generalized Advantage Estimation</em> paper (<a href="https://arxiv.org/abs/1506.02438">arXiv link</a>),
the authors do so with a proof exactly matching the above, except that <script type="math/tex">T</script>
and <script type="math/tex">T1</script> are now infinity.</p>
</li>
<li>
<p><strong>Note III</strong>: About the expectation going to zero, that’s due to a
wellknown fact about <em>score</em> functions, which are precisely the gradient of
log probailities. We went over this in <a href="https://danieltakeshi.github.io/2016/12/20/reviewoftheoreticalstatisticsstat210aatberkeley/">my STAT 210A class last fall</a>. It’s
<em>again</em> the log derivative trick. Observe that:</p>
<script type="math/tex; mode=display">\mathbb{E}_{a_t}\Big[\nabla_\theta \log \pi_\theta(a_ts_t)\Big]
= \int \frac{\nabla_\theta
\pi_\theta(a_ts_t)}{\pi_{\theta}(a_ts_t)}\pi_{\theta}(a_ts_t)da_t
= \nabla_\theta \int \pi_{\theta}(a_ts_t)da_t = \nabla_\theta \cdot 1 = 0</script>
<p>where the penultimate step follows from how <script type="math/tex">\pi_\theta</script> is a density. This
follows for all time steps, and since the gradient of the log gets distributed
for each <script type="math/tex">t</script>, it applies in all time steps. I switched to the
continuousland version for this, but it also applies with sums, as I just
recently used in Note I.</p>
</li>
</ul>
<p>The above shows that introducing <script type="math/tex">b</script> doesn’t cause bias.</p>
<p>The last thing to cover is why its introduction reduces variance. I provide an
approximate argument. To simplify notation, set <script type="math/tex">R_t(\tau) =
\sum_{t'=t}^{T1}r_{t'}</script>. We focus on the <em>inside</em> of the expectation (of the
gradient estimate) to analyze the variance. The technical reason for this is
that expectations are technically <em>constant</em> (and thus have variance zero) but
in practice we have to approximate the expectations with trajectories, and that
has high variance.</p>
<p>The variance is approximated as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
{\rm Var}\left(\sum_{t=0}^{T1}\nabla_\theta \log \pi_\theta(a_ts_t) (R_t(\tau)b(s_t))\right)\;&\overset{(i)}{\approx}\; \sum_{t=0}^{T1} \mathbb{E}\tau\left[\Big(\nabla_\theta \log \pi_\theta(a_ts_t) (R_t(\tau)b(s_t))\Big)^2\right] \\
\;&{\overset{(ii)}{\approx}}\; \sum_{t=0}^{T1} \mathbb{E}_\tau \left[\Big(\nabla_\theta \log \pi_\theta(a_ts_t)\Big)^2\right]\mathbb{E}_\tau\left[\Big(R_t(\tau)  b(s_t))^2\right]
\end{align} %]]></script>
<p><strong>Approximation (i)</strong> is because we are approximating the variance of a sum by
computing the sum of the variances. This is not true in general, but if we can
assume this, then by the definition of the variance <script type="math/tex">{\rm Var}(X) :=
\mathbb{E}[X^2](\mathbb{E}[X])^2</script>, we are left with the <script type="math/tex">\mathbb{E}[X^2]</script>
term since we already showed that introducing the baseline doesn’t cause bias.
<strong>Approximation (ii)</strong> is because we assume independence among the values
involved in the expectation, and thus we can factor the expectation.</p>
<p>Finally, we are left with the term <script type="math/tex">\mathbb{E}_{\tau} \left[\Big(R_t(\tau) 
b(s_t))^2\right]</script>. If we are able to optimize our choice of <script type="math/tex">b(s_t)</script>, then
this is a least squares problem, and it is well known that the optimal choice of
<script type="math/tex">b(s_t)</script> is to be the expected value of <script type="math/tex">R_t(\tau)</script>. In fact, that’s <em>why</em>
policy gradient researchers usually want <script type="math/tex">b(s_t) \approx
\mathbb{E}[R_t(\tau)]</script> to approximate the expected return starting at time
<script type="math/tex">t</script>, and that’s <em>why</em> in the vanilla policy gradient algorithm we have to
refit the baseline estimate each time to make it as close to the expected
return <script type="math/tex">\mathbb{E}[R_t(\tau)]</script>. At last, I understand.</p>
<p>How accurate are these approximations in practice? My intuition is that they are
actually fine, because recent advances in reinforcement learning algorithms,
such as A3C, focus on the problem of breaking correlation among samples. If the
correlation among samples is broken, then Approximation (i) becomes better,
because I think the samples <script type="math/tex">s_0,a_0,\ldots,a_{T1},s_{T}</script> are <em>no longer
generated from the same trajectory</em>.</p>
<p>Well, that’s my intuition. If anyone else has a better way of describing it,
feel free to let me know in the comments or by email.</p>
<h1 id="discountfactors">Discount Factors</h1>
<p>So far, we have assumed we wanted to optimize the expected return, or the
expected <em>sum of rewards</em>. However, if you’ve studied value iteration and policy
iteration, you’ll remember that we usually use <em>discount factors</em> <script type="math/tex">\gamma \in
(0,1]</script>. These empirically work well because the effect of an action many time
steps later is likely to be negligible compared to other action. Thus, it may
not make sense to try and include raw distant rewards in our optimization
problem. Thus, we often impose a discount as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}r_{t'}  b(s_t)\right) \right] \\
&\approx \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}\gamma^{t't}r_{t'}  b(s_t)\right) \right]
\end{align} %]]></script>
<p>where the <script type="math/tex">\gamma^{t't}</script> serves as the discount, starting from 1, then
getting smaller as time passes. (The first line above is a repeat of the policy
gradient formula that I describe earlier.) As this is not exactly the “desired”
gradient, this is an <em>approximation</em>, but it’s a reasonable one. This time, we
now want our baseline to satisfy <script type="math/tex">b(s_t) \approx \mathbb{E}[r_t + \gamma
r_{t+1} + \cdots + \gamma^{T1t} r_{T1}]</script>.</p>
<h1 id="advantagefunctions">Advantage Functions</h1>
<p>In this final section, we replace the policy gradient formula with the following
<em>value</em> functions:</p>
<script type="math/tex; mode=display">Q^\pi(s,a) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T1} r_t \;\Bigg\; s_0=s,a_0=a\right]</script>
<script type="math/tex; mode=display">V^\pi(s) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T1} r_t \;\Bigg\; s_0=s\right]</script>
<p>Both of these should be familiar from basic AI; see the CS 188 notes from
Berkeley if this is unclear. There are also <em>discounted</em> versions, which we can
denote as <script type="math/tex">Q^{\pi,\gamma}(s,a)</script> and <script type="math/tex">V^{\pi,\gamma}(s)</script>. In addition, we can
also consider starting at any given time step, as in <script type="math/tex">Q^{\pi,\gamma}(s_t,a_t)</script>
which provides the expected (discounted) return assuming that at time <script type="math/tex">t</script>, our
stateaction pair is <script type="math/tex">(s_t,a_t)</script>.</p>
<p>What might be new is the <em>advantage</em> function. For the undiscounted version, it
is defined simply as:</p>
<script type="math/tex; mode=display">A^\pi(s,a) = Q^\pi(s,a)  V^\pi(s)</script>
<p>with a similar definition for the discounted version. Intuitively, the advantage
tells us how much better action <script type="math/tex">a</script> would be compared to the return based on
an “average” action.</p>
<p>The above definitions look very close to what we have in our policy gradient
formula. In fact, we can claim the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \left(\sum_{t'=t}^{T1}r_{t'}  b(s_t)\right) \right] \\
&{\overset{(i)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \cdot \Big(Q^{\pi}(s_t,a_t)V^\pi(s_t)\Big) \right] \\
&{\overset{(ii)}{=}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \cdot A^{\pi}(s_t,a_t) \right] \\
&{\overset{(iii)}{\approx}}\; \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T1} \nabla_\theta \log \pi_\theta(a_ts_t) \cdot A^{\pi,\gamma}(s_t,a_t) \right]
\end{align} %]]></script>
<p>In (i), we replace terms with their expectations. This is <em>not</em> generally valid
to do, but it should work in this case. My guess is that if you start from the
second line above (after the “(i)”) and plug in the definition of the
expectation inside and rearrange terms, you can get the first line. However, I
have not had the time to check this in detail and it takes a lot of space to
write out the expectation fully. The conditioning with the value functions makes
it a bit messy and thus the law of iterated expectation may be needed.</p>
<p>Also from line (i), we notice that <em>the value function is a baseline</em>, and hence
we can add it there without changing the unbiasedness of the expectation. Then
lines (ii) and (iii) are just for the advantage function. The implication of
this formula is that the problem of policy gradients, in some sense, <em>reduces to
finding good estimates <script type="math/tex">\hat{A}^{\pi,\gamma}(s_t,a_t)</script> of the advantage
function</em> <script type="math/tex">A^{\pi,\gamma}(s_t,a_t)</script>. That is precisely the topic of the paper
<em><a href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation</a></em>.</p>
<h1 id="concludingremarks">Concluding Remarks</h1>
<p>Hopefully, this is a helpful, selfcontained, bareminimum introduction to
policy gradients. I am trying to learn more about these algorithms, and going
through the math details is helpful. This will also make it easier for me to
understand the increasing number of research papers that are using this
notation.</p>
<p>I also have to mention: I remember a few years ago during the <a href="https://danieltakeshi.github.io/20151217reviewofdeepreinforcementlearningcs294112atberkeley/">first iteration
of CS 294112</a> that I had no idea how policy gradients worked. Now, I think I
have become slightly more enlightened.</p>
<p><strong>Acknowledgements</strong>: I thank John Schulman for making his notes publicly
available.</p>
<p><strong>Update April 19, 2017</strong>: I have code for vanilla policy gradients in my
<a href="https://github.com/DanielTakeshi/rl_algorithms">reinforcement learning GitHub repository</a>.</p>
Mon, 27 Mar 2017 23:00:00 0700
https://danieltakeshi.github.io/2017/03/28/goingdeeperintoreinforcementlearningfundamentalsofpolicygradients/
https://danieltakeshi.github.io/2017/03/28/goingdeeperintoreinforcementlearningfundamentalsofpolicygradients/Keeping Track of Research Articles: My Paper Notes Repository<p>The number of research papers in Artificial Intelligence has reached
unmanageable proportions. Conferences such as ICML, NIPS, and ICLR others are
getting record amounts of paper submissions. In addition, tens of AIrelated
papers get uploaded to arXiv <em>every weekday</em>. With all these papers, it can be
easy to feel lost and overwhelmed.</p>
<p>Like many researchers, I think I do not read enough research papers. This year,
I resolved to change that, so I started an <a href="https://github.com/DanielTakeshi/Paper_Notes">opensource GitHub repository called
“Paper Notes”</a> where I list papers that I’ve read along with my personal
notes and summaries, if any. Papers without such notes are currently on my TODO
radar.</p>
<p>After almost three months, I’m somewhat pleased with my reading progress. There
are a healthy number of papers (plus notes) listed, arranged by subject matter
and then further arranged by year. Not enough for me, but certainly not terrible
either.</p>
<p>I was inspired to make this by seeing <a href="https://github.com/dennybritz/deeplearningpapernotes">Denny Britz’s similar repository</a>,
along with <a href="https://blog.acolyer.org/about/">Adrian Colyer’s blog</a>. My repository is similar to Britz’s,
though my aim is not to list all papers in Deep Learning, but to write down the
ones that I actually plan to read at some point. (I see other repositories where
people simply list Deep Learning papers without notes, which seems pretty
pointless to me.) Colyer’s blog posts represent the kind of notes that I’d like
to take for each paper, but I know that I can’t dedicate <em>that</em> much time to
finetuning notes.</p>
<p>Why did I choose GitHub as the backend for my paper management, rather than
something like <a href="https://www.mendeley.com/">Mendeley</a>? First, GitHub is the default place where (pretty
much) everyone in AI puts their opensource stuff: blogs, code, you name it. I’m
already used to GitHub, so Mendeley would have to provide some serious benefit
for me to switch over. I also don’t need to use advanced annotation and
organizing materials, given that the top papers are easily searchable online
(including their BibTeX references). In addition, by making my Paper Notes
repository online, I can show this as evidence to others that I’m reading
papers. Maybe this will even impress a few folks, and I say this only because
everyone wants to be noticed in some way; that’s partly Colyer’s inspiration for
his blog. So I think, on balance, it will be useful for me to keep updating
this repository.</p>
Thu, 23 Mar 2017 15:30:00 0700
https://danieltakeshi.github.io/2017/03/23/keepingtrackofresearcharticlesmypapernotesrepository/
https://danieltakeshi.github.io/2017/03/23/keepingtrackofresearcharticlesmypapernotesrepository/