Seita's PlaceThis is my blog, where I have written over 250 articles on a variety of topics, most of which are about one of two major themes. The first is computer science, which is my area of specialty as a Ph.D. student at UC Berkeley. The second can be broadly categorized as "deafness," which relates to my experience and knowledge of being deaf.
https://danieltakeshi.github.io/
Sat, 15 Jul 2017 21:33:47 0700Sat, 15 Jul 2017 21:33:47 0700Jekyll v3.4.5How I Organize My GitHub Repositories<p>I’ve been putting more of my workrelated stuff <a href="https://github.com/DanielTakeshi">in GitHub repositories</a> and
by now I have more or less settled on a reasonable workflow for utilizing
GitHub. For those of you who are new to this, GitHub helps us easily visualize
and share <em>code repositories</em> online, whether in public (visible to everyone) or
private (visible only to those with permissions), though technically
repositories don’t have to be strictly codebased. GitHub uses version control
in combination with <em>git</em>, which is what actually handles the technical
machinery. It’s grown into the de facto place where computer scientists —
particularly those in Artificial Intelligence — present their work. What
follows is a brief description of what I use GitHub for; in particular, I have
many <em>public</em> repositories along with a few <em>private</em> repositories.</p>
<p>For <em>public</em> repositories, I have the following:</p>
<ul>
<li>A <strong>Paper Notes</strong> repository, where I <a href="https://github.com/DanielTakeshi/Paper_Notes">write notes for research papers</a>. A
few months ago, <a href="https://danieltakeshi.github.io/2017/03/23/keepingtrackofresearcharticlesmypapernotesrepository/">I wrote a brief blog post</a> describing why I decided to do
this. Fortunately, I have come back to this repository several times to see
what I wrote for certain research papers. The more I’m doing this, the more
useful it is! The same holds for running a blog; the more I find myself
rereading it, the better!</li>
<li>A repository for <strong>coding various algorithms</strong>. I actually have two
repositories which carry out this goal: one for <a href="https://github.com/DanielTakeshi/rl_algorithms">reinforcement learning</a>
and another for <a href="https://github.com/DanielTakeshi/MCMC_and_Dynamics">MCMCrelated stuff</a>. The goal of these is to help me
understand existing algorithms; many of the stateoftheart algorithms are
tricky to implement precisely because they are stateoftheart.</li>
<li>A repository for <strong>miscellaneous personal projects</strong>, such as one for <a href="https://github.com/DanielTakeshi/Project_Euler_in_C">Project
Euler problems</a> (yes, I’m still doing that … um, barely!) and another for
<a href="https://github.com/DanielTakeshi/Self_Study_Courses">selfstudying various</a> courses and textbooks.</li>
<li>A repository for <strong>preparing for coding interviews</strong>. I thought it might be
useful to post <a href="https://github.com/DanielTakeshi/Interview_Practice">some of my solutions to practice problems</a>.</li>
<li>A repository for my <strong>vimrc</strong> file. Right now <a href="https://github.com/DanielTakeshi/vimrc">my vimrc file is only a few
lines</a>, but it might get more complex. I’m using a number of computers
nowadays (mostly via ssh), so one of the first steps to get started with a
machine is to clone the repository and establish my vimrc.</li>
<li>Lastly, but certainly not least, don’t forget that <a href="https://github.com/DanielTakeshi/DanielTakeshi.github.io">there’s a repository</a>
for <strong>my blog</strong>. That’s obviously the most important one!</li>
</ul>
<p>On the other hand, there are many cases when it makes sense for individuals to
use private repositories. (I’m using “individuals” here since it should be clear
that all companies have their “critical” code in private version control.) Here
are some of the private repositories I have:</p>
<ul>
<li><strong>All ongoing research projects</strong> have their own private repository. This
should be a nobrainer. You don’t want to get scooped, particularly with a
fastpaced field such as Artificial Intelligence. Once such papers are ready
to be posted to arXiv, that’s when the repository can be released to the
public, or copied to a new public one to start fresh.</li>
<li>I also have one repository that I’ll call a <strong>research sandbox</strong>. It contains
multiple random ideas I have, and I run smallerscale experiments here to test
ideas. If any ideas look like they’ll work, I start a new repository to
develop them further. On a side note, running quick experiments to test an
idea before scaling it up is a skill that I need to work on!</li>
<li>Finally, I have a repository for <strong>homework</strong>, which also includes class final
projects. It’s particularly useful for when one has laptops that are
relatively old (like mine) since the computer might die and thus all my work
LaTeXing statistics homework might be lost. At this point, though, I think
I’m done taking any real classes so I don’t know if I’ll be using this one
anymore.</li>
</ul>
<p>Well, this is a picture of how I manage my repositories. I am pleased with this
configuration, and perhaps others who are starting out with GitHub might adapt
some of these repositories for themselves.</p>
Sat, 15 Jul 2017 14:00:00 0700
https://danieltakeshi.github.io/2017/07/15/howiorganizemygithubrepositories/
https://danieltakeshi.github.io/2017/07/15/howiorganizemygithubrepositories/Saving Neural Network Model Weights Using a Hierarchical Organization<p>Over the last two weeks, I have been using more Theanobased code for Deep
Learning instead of TensorFlow, in part due to diving into OpenAI’s <a href="https://danieltakeshi.github.io/2017/06/15/openaisgenerativeadversarialimitationlearningcode/">Generative
Adversarial Imitation Learning code</a>.</p>
<p>That code base has also taught me something that I have wondered about on
occasion: what is the “proper” way to save and load neural network model
weights? At the very least, how should we as programmers save weights in a way
that’s robust, scalable, and easy to understand? In my view, there are two major
steps to this procedure:</p>
<ol>
<li>Extracting or setting the model weights from a single vector of parameters.</li>
<li>Actually storing that vector of weights in a file.</li>
</ol>
<p>One way to do the first step is to save model weights in a vector, and use that
vector to load the weights back to the model as needed. I do this in <a href="https://github.com/DanielTakeshi/rl_algorithms">my
personal reinforcement learning repository</a>, for instance. It’s implemented
in TensorFlow, but the main ideas still hold across Deep Learning software.
Here’s a conceptually selfcontained code snippet for <em>setting</em> model weights
from a vector <code class="highlighterrouge">self.theta</code>:</p>
<figure class="highlight"><pre><code class="languagepython" datalang="python"><span class="bp">self</span><span class="o">.</span><span class="n">theta</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">num_params</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">"theta"</span><span class="p">)</span>
<span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">updates</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">:</span>
<span class="n">shape</span> <span class="o">=</span> <span class="n">v</span><span class="o">.</span><span class="n">get_shape</span><span class="p">()</span>
<span class="n">size</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_prod</span><span class="p">(</span><span class="n">shape</span><span class="p">)</span>
<span class="c"># Note that tf.assign(ref, value) assigns `value` to `ref`.</span>
<span class="n">updates</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">tf</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">start</span><span class="o">+</span><span class="n">size</span><span class="p">],</span> <span class="n">shape</span><span class="p">))</span>
<span class="p">)</span>
<span class="n">start</span> <span class="o">+=</span> <span class="n">size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">set_params_flat_op</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="o">*</span><span class="n">updates</span><span class="p">)</span> <span class="c"># Performs all updates together.</span></code></pre></figure>
<p>In later code, I run TensorFlow sessions on <code class="highlighterrouge">self.set_params_flat_op</code> and supply
<code class="highlighterrouge">self.theta</code> with the weight vector in the <code class="highlighterrouge">feed_dict</code>. Then it iteratively
makes an update to extract a segment of the <code class="highlighterrouge">self.theta</code> vector and assigns it
to the correct weight. The main thing to watch out about here is that
<code class="highlighterrouge">self.theta</code> actually contains the weights in the correct ordering.</p>
<p>I’m more curious about the second stage of this process, that of saving and
loading weights into files. I used to use pickle files to save the weight
vectors, but one problem is the <a href="https://stackoverflow.com/questions/28218466/unpicklingapython2objectwithpython3">incompatibility between Python 2 and Python 3
pickle files</a>. Given that I sometimes switch back and forth between
versions, and that I’d like to keep the files consistent across versions, this
is a huge bummer for me. Another downside is the lack of <em>organization</em>. Again,
I still have to be careful to ensure that the weights are stored in the correct
ordering so that I can use <code class="highlighterrouge">self.theta[start:start+size]</code>.</p>
<p>After looking at how the GAIL code stores and loads model weights, I realized
it’s different from saving single pickle or numpy arrays. I started by running
their Trust Region Policy Optimization code (<code class="highlighterrouge">scripts/run_rl_mj.py</code>) and
observed that the code specifies neural network weights with a list of
dictionaries. Nice! I was wondering about how I could better generalize my
existing neural network code.</p>
<p>Moving on, what happens after saving the snapshots? (In Deep Learning it’s
common to refer to weights after specific iterations as “snapshots” to be
saved.) The GAIL code uses a <code class="highlighterrouge">TrainingLog</code> class which utilizes <a href="http://www.pytables.org/">PyTables</a>
and — by extension — the HDF5 file format. If I run the TRPO code I might
get <code class="highlighterrouge">trpo_logs/CartPolev0.h5</code> as the output file. It doesn’t have to end with
the HDF5 extension <code class="highlighterrouge">.h5</code> but that’s the convention. Policies in the code are
subclasses of a generic <code class="highlighterrouge">Policy</code> class to handle the case of discrete versus
continuous control. The <code class="highlighterrouge">Policy</code> class is a subclass of an abstract <code class="highlighterrouge">Model</code>
class which provides an interface for saving and loading weights.</p>
<p>I decided to explore a bit more, this time using the pretrained CartPolev0
policy provided by GAIL:</p>
<figure class="highlight"><pre><code class="languagepython" datalang="python"><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">import</span> <span class="nn">h5py</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="k">with</span> <span class="n">h5py</span><span class="o">.</span><span class="n">File</span><span class="p">(</span><span class="s">"expert_policies/classic/CartPolev0.h5"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span>
<span class="o">...</span><span class="p">:</span>
<span class="p">[</span><span class="s">u'log'</span><span class="p">,</span> <span class="s">u'snapshots'</span><span class="p">]</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="k">with</span> <span class="n">h5py</span><span class="o">.</span><span class="n">File</span><span class="p">(</span><span class="s">"expert_policies/classic/CartPolev0.h5"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">[</span><span class="s">'log'</span><span class="p">])</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">[</span><span class="s">'snapshots'</span><span class="p">])</span>
<span class="o">...</span><span class="p">:</span>
<span class="o"><</span><span class="n">HDF5</span> <span class="n">dataset</span> <span class="s">"log"</span><span class="p">:</span> <span class="n">shape</span> <span class="p">(</span><span class="mi">101</span><span class="p">,),</span> <span class="nb">type</span> <span class="s">"V80"</span><span class="o">></span>
<span class="o"><</span><span class="n">HDF5</span> <span class="n">group</span> <span class="s">"/snapshots"</span> <span class="p">(</span><span class="mi">6</span> <span class="n">members</span><span class="p">)</span><span class="o">></span>
<span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="k">with</span> <span class="n">h5py</span><span class="o">.</span><span class="n">File</span><span class="p">(</span><span class="s">"expert_policies/classic/CartPolev0.h5"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">[</span><span class="s">'snapshots/iter0000100/GibbsPolicy/hidden/FeedforwardNet/layer_0/AffineLayer/W'</span><span class="p">]</span><span class="o">.</span><span class="n">value</span><span class="p">)</span>
<span class="o">...</span><span class="p">:</span>
<span class="c"># value gets printed here ...</span></code></pre></figure>
<p>It took me a while to figure this out, but here’s how to walk through the nodes
in the entire file:</p>
<figure class="highlight"><pre><code class="languagepython" datalang="python"><span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="k">def</span> <span class="nf">print_attrs</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">obj</span><span class="p">):</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
<span class="o">...</span><span class="p">:</span> <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="n">obj</span><span class="o">.</span><span class="n">attrs</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span>
<span class="o">...</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">" {}: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">))</span>
<span class="o">...</span><span class="p">:</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="n">expert_policy</span> <span class="o">=</span> <span class="n">h5py</span><span class="o">.</span><span class="n">File</span><span class="p">(</span><span class="s">"expert_policies/classic/CartPolev0.h5"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="n">expert_policy</span><span class="o">.</span><span class="n">visititems</span><span class="p">(</span><span class="n">print_attrs</span><span class="p">)</span>
<span class="c"># Lots of stuff printed here!</span></code></pre></figure>
<p>PyTables works well for <em>hierarchical data</em>, which is nice for Deep
Reinforcement Learning because there are many ways to form a hierarchy:
snapshots, iterations, layers, weights, and so on. All in all, PyTables looks
like a tremendously useful library. I should definitely consider using it to
store weights. Furthermore, even if it would be easier to store with a single
weight vector as I now do (see my TensorFlow code snippet from earlier) the
generality of PyTables means it might have crossover effects to other code I
want to run in the future. Who knows?</p>
Thu, 06 Jul 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/07/06/savingneuralnetworkmodelweightsusingahierarchicalorganization/
https://danieltakeshi.github.io/2017/07/06/savingneuralnetworkmodelweightsusingahierarchicalorganization/Review of Theoretical Statistics (STAT 210B) at Berkeley<p>After taking STAT 210A last semester (and <a href="https://danieltakeshi.github.io/2016/12/20/reviewoftheoreticalstatisticsstat210aatberkeley/">writing way too much about it</a>),
it made sense for me to take STAT 210B, the continuation of Berkeley’s
theoretical statistics course aimed at PhD students in statistics and related
fields.</p>
<h2 id="thebeginning">The Beginning</h2>
<p>Our professor was <a href="https://people.eecs.berkeley.edu/~jordan/">Michael I. Jordan</a>, who is colloquially called the
“Michael Jordan of machine learning.” Indeed, how does one begin to describe
his research? Yann LeCun, himself an extraordinarily prominent Deep Learning
researcher and considered as one of the three leaders in the
field<sup id="fnref:schmidhuber"><a href="#fn:schmidhuber" class="footnote">1</a></sup>, said this<sup id="fnref:interview"><a href="#fn:interview" class="footnote">2</a></sup> in a <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">public Facebook post</a>:</p>
<blockquote>
<p>Mike’s research direction tends to take radical turns every 5 years or so,
from cognitive psychology, to neural nets, to motor control, to probabilistic
approaches, graphical models, variational methods, Bayesian nonparametrics,
etc. Mike is the “Miles Davis of Machine Learning”, who reinvents himself
periodically and sometimes leaves fans scratching their heads after he changes
direction.</p>
</blockquote>
<p>And Professor Jordan responded with:</p>
<blockquote>
<p>I am particularly fond of your “the Miles Davis of machine learning” phrase.
(While “he’s the Michael Jordan of machine learning” is amusing—or so I’m
told—your version actually gets at something real).</p>
</blockquote>
<p>As one would expect, he’s extremely busy, and I think he had to miss four
lectures for 210B. Part of the reason might be because, as he mentioned to us:
“I wasn’t planning on teaching this course … but as chair of the statistics
department, I assigned it to myself. I though it would be fun to teach.” The TAs
were able to substitute, though it seemed like some of the students in the class
decided to skip those lectures.</p>
<p>Just because him teaching 210B was somewhat “unplanned” doesn’t mean that it was
easy — far from it! In the first minute of the first lecture, he said that
210B is the hardest course that the statistics department offers. Fortunately,
he followed up with saying that the grading would be lenient, that he didn’t
want to scare us, and so forth. Whew. We also had two TAs (or “GSIs” in Berkeley
language) who we could ask for homework assistance.</p>
<p>Then we dived into the material. One of the first things we talked about was
<em>UStatisics</em>, a concept that can often trick me up because of my lack of
intuition in internalizing expectations of expectations and how to rearrange
related terms in clever ways. Fortunately, we had a homework assignment question
about UStatistics in 210A so I was able to follow some of the material. We also
talked about the related <em>Hájek projection</em>.</p>
<h2 id="divingintohighdimensionalstatistics">Diving into HighDimensional Statistics</h2>
<p>We soon delved into to the meat of the course. I consider this to be the
material in our textbook for the course, Professor Martin Wainwright’s recent
book <em>HighDimensional Statistics: A NonAsymptotic Viewpoint</em>.</p>
<p>For those of you who don’t know, Professor Wainwright is a faculty member in the
Berkeley statistics and EECS departments who won the 2014 COPSS “Nobel Prize in
Statistics” award due to his work on high dimensional statistics. <a href="https://simplystatistics.org/2014/08/18/interviewwithcopssawardwinnermartinwainright/">Here’s the
transcript of his interview</a>, where he says that serious machine learning
students <em>must</em> know statistics. As a caveat, the students he’s referring to are
the kind that populate the PhD programs in schools like Berkeley, so he’s
talking about the best of the best. It’s true that <em>basic</em> undergraduate
statistics courses are useful for a broad range of students — and I wish I had
taken more when I was in college — but courses like 210B are not needed for
all but a handful of students in specialized domains.</p>
<p>First, what is “highdimensional” statistics? Suppose we have parameter <script type="math/tex">\theta
\in \mathbb{R}^d</script> and <script type="math/tex">n</script> labeled data points <script type="math/tex">\{(x_i,y_i)\}_{i=1}^n</script> which
we can use to estimate <script type="math/tex">\theta</script> via linear regression or some other procedure.
In the classical setting, we can safely assume that <script type="math/tex">n > d</script>, or that <script type="math/tex">n</script> is
allowed to increase while the data dimension <script type="math/tex">d</script> is typically held fixed. This
is not the case in highdimensional (or “modern”) statistics where the
relationship is reversed, with <script type="math/tex">d > n</script>. Classical algorithms end up running
into brick walls into these cases, so new theory is needed, which is precisely
the main contribution of Wainwright’s research. It’s also the main focus of STAT
210B.</p>
<p>The most important material to know from Wainwright’s book is the stuff from the
second chapter: subGaussian random variables, subExponential random variables,
bounds from Lipschitz functions, and so on. We referenced back to this material
<em>all</em> the time.</p>
<p>We then moved away from Wainwright’s book to talk about entropy, the EfronStein
Inequality, and related topics. Professor Jordan criticized Professor Wainwright
for not including the material in this book. I somewhat agree with him, but for
a different reason: I found this material harder to follow compared to other
class concepts, so it would have been nice to see Professor Wainwright’s
interpretation of it.</p>
<p>Note to future students: get the book by Boucheron, Lugosi, and Massart, titled
<em>Concentration Inequalities: a Nonasymptotic Theory of Independence</em>. I think
that’s the book Professor Jordan was reviewing when he gave these
nonWainwrightrelated lectures, because he was using the same exact notation as
in the book.</p>
<p>How did I know about the book, which amazingly, <em>wasn’t even listed on the
course website</em>? Another student brought it to the class and I peeked over the
student’s shoulder to see the title. Heh. I memorized the title and promptly
ordered it online. Unfortunately, or perhaps fortunately, Professor Jordan then
moved on to exclusively material from Professor Wainwright’s book.</p>
<p>If any future students want to buy off the Boucheron et al book from me, send me
an email.</p>
<p>After a few lectures, it was a relief to me when we returned to material from
Wainwright’s book, which included:</p>
<ul>
<li>Rademacher and Gaussian Complexity (these concepts were briefly discussed in a
<a href="https://danieltakeshi.github.io/2017/05/19/understandingdeeplearningrequiresrethinkinggeneralizationmythoughtsandnotes">Deep Learning paper I recently blogged about</a>)</li>
<li>Metric entropy, coverings, and packings</li>
<li>Random matrices and high dimensional covariance matrix estimation</li>
<li>High dimensional, sparse linear models</li>
<li>Nonparametric least squares</li>
<li>Minimax lower bounds, a “Berkeley specialty” according to Professor Jordan</li>
</ul>
<p>I obtained a decent understanding of how these concepts relate to each other.
The concepts appear in many chapters outside the ones when they’re formally
defined, because they can be useful as “subroutines” or as part of technical
lemmas for other problems.</p>
<p>Despite my occasional complaint about not understanding details in Wainwright’s
book — which I’ll bring up later in this blog post — I think
the book is aboveaverage in terms of clarity, relative to other textbooks aimed
at graduate students. There were often enough highlevel discussions so that I
could see the big picture. One thing that needs to be fixed, though, are the
typos. Professor Jordan frequently pointed these out during lecture, and would
also sometimes ask us to confirm his suspicions that something was a typo.</p>
<p>Regarding homework assignments, we had seven of them, each of which was about
five or so problems with multiple parts per problem. I was usually able to
correctly complete about half of each homework by myself. For the other half, I
needed to consult the GSIs, other students, or perform extensive online research
to assist me with the last parts. Some of the homework problems were clearly
inspired by Professor Wainwright’s research papers, but I didn’t have much
success translating from research paper to homework solution.</p>
<p>For me, some of the most challenging homework problems pertained to material
that wasn’t in Wainwright’s textbook. In part this is because some of the
problems in Wainwright’s book have a similar flavor to exercises in the main
text of the book, which were often accompanied with solutions.</p>
<h2 id="thefinalexam">The Final Exam</h2>
<p>In one of the final lectures of the class, Professor Jordan talked about the
final exam — that it would cover a range of questions, that it would be
difficult, and so forth — but then he also mentioned that he could complete it
<em>in an hour</em>. (Final exams in Berkeley are in threehour slots.) While he
quickly added “I don’t mean to disparage you…”, unfortunately I found
the original comment about completing the exam in an hour quite disparaging. I’m
baffled by why professors say that; it seems to be a nowin solution for the
students. Furthermore, no student is going to question a Berkeley professor’s
intelligence; I certainly wouldn’t.</p>
<p>That comment aside, the final exam was scheduled to be Thursday at 8:00AM (!!)
in the morning. I was hoping we could keep this time slot, since I am a morning
person and if other students aren’t, then I have a competitive advantage.
Unfortunately, Professor Jordan agreed with the majority in the class that he
hated the time, so we had a poll and switched to Tuesday at 3:00PM. Darn. At
least we know now that professors are often more lenient towards graduate
students than undergrads.</p>
<p>On the day of the final exam, I felt something really wrenching. And it wasn’t
something that had to do with the actual exam, though that of course was also
“wrenching.” It was this:</p>
<blockquote>
<p>It looked like my streak of having all professors know me on a firstname
basis was about to be snapped.</p>
</blockquote>
<p>For the last <em>seven years</em> at Williams and Berkeley, I’m pretty sure I managed
to be known on a firstname basis to the professors from <em>all</em> of my courses.
Yes, all of them. It’s easier to get to know professors at Williams, since the
school is small and professors often make it a point to know the names of every
student. At Berkeley it’s obviously different, but graduatelevel courses tend
to be better about oneonone interaction with students/professors. In addition,
I’m the kind of student who frequently attends office hours. On top of it all,
due to my deafness, I get some form of visible accommodation, either captioning
(CART providers) or sign language interpreting services.</p>
<p>Yes, I have <em>a little bit</em> of an unfair advantage in getting <em>noticed</em> by
professors<sup id="fnref:sadly"><a href="#fn:sadly" class="footnote">3</a></sup>, but I was worried that my streak was about to be snapped.
It wasn’t for lack of trying; I had indeed attended office hours once with
Professor Jordan (who promptly criticized me for my lack of measure theory
knowledge) and yes, he was obviously aware of the sign language interpreters I
had, but as far as I can tell he didn’t really <em>know</em> me.</p>
<p>So here’s what happened just before we took the final. Since the exam was at a
different time slot than the “official” one, Professor Jordan decided to take
attendance.</p>
<p>My brain orchestrated an impressive mental groan. It’s a pain for me to figure
out when I should raise my hand. I did not have a sign language interpreter
present, because why? It’s a three hour exam and there wouldn’t be (well, there
better not be!) any real discussion. I also have bad memories because one time
during a high school track practice, I gambled and raised my hand when the team
captains were taking attendance … only to figure out that the person being
called at that time had “Rizzuto” as his last name. Oops.</p>
<p>Then I thought of something. <em>Wait</em> … why should I even raise my hand? If
Professor Jordan knew me, then surely he would indicate to me in some way (e.g.
by staring at me). Furthermore, if my presence was that important to the extent
that my absence would cause a police search for me, then another student or TA
should certainly point me out.</p>
<p>So … Professor Jordan took attendance. I kept turning around to see the
students who raised their hand (I sat in the front of the class. Big surprise!).
I grew anxious when I saw the raised hand of a student whose last name started
with “R”. It was the moment of truth …</p>
<p>A few seconds later … Professor Jordan looked at me and checked something off
on his paper — <em>without</em> consulting anyone else for assistance. I held my
breath mentally, and when another student whose last name was after mine was
called, I grinned.</p>
<p>My streak of having professors know me continues! Whew!</p>
<p>That personal scenario aside, let’s get back to the final exam. Or, maybe not. I
probably can’t divulge too much about it, given that some of the material might
be repeated in future iterations of the course. Let me just say two things
regarding the exam:</p>
<ul>
<li>Ooof. Ouch. Professor Jordan wasn’t kidding when he said that the final exam
was going to be difficult. Not a single student finished early, though some
were no doubt quadruplechecking their answers, right?</li>
<li>Professor Jordan wasn’t kidding when he said that the class would be graded
leniently.</li>
</ul>
<p>I don’t know what else there is to say.</p>
<h2 id="iamdyingtoknow">I am Dying to Know</h2>
<p>Well, STAT 210B is now over, and in retrospect I am <em>really</em> happy I took the
course. Even though I know I won’t be doing research in this field, I’m glad
that I got a taste of the research frontier in highdimensional statistics and
theoretical machine learning. I hope that understanding some of the math here
can transfer to increased comprehension of technical material more directly
relevant to my research.</p>
<p>Possibly more than anything else, STAT 210B made me really appreciate the
enormous talent and ability that Professor Michael I. Jordan and Professor
Martin Wainwright exhibit in math and statistics. I’m blown away at how fast
they can process, learn, connect, and explain technically demanding material.
And the fact that Professor Wainwright wrote the textbook solo, and that much of
the material there <em>comes straight from his own research papers</em> (often
coauthored with Professor Jordan!) surely attests to why those two men are
awardwinning statistics and machine learning professors.</p>
<p>It makes me wonder: <em>what do I lack compared to them</em>? I know that throughout my
life, being deaf has put me at a handicap, which my white male privilege (even
though I’m not white) can’t completely overcome. But if Professor Jordan or
Professor Wainwright and I were to sit sidebyside and each read the latest
machine learning research paper, they would be able to process and understand
the material far faster than I could. Reading a research paper theoretically
means my disability shouldn’t be a strike on me.</p>
<p>So what is it that prevents me from being like those two?</p>
<p>I tried doing as much of the lecture reading as I could, and I truly understood
a lot of the material. Unfortunately, many times I would get bogged down by some
technical item which I couldn’t wrap my head around, or I would fail to fill in
missing steps to argue why some “obvious” conclusion is true. Or I would miss
some (obvious?) mathematical trick that I needed to apply, which was one of the
motivating factors for me writing <a href="https://danieltakeshi.github.io/2017/05/06/mathematicaltrickscommonlyusedinmachinelearningandstatistics">a lengthy blog post about these mathematical
tricks</a>.</p>
<p>Then again, after one of the GSIs grinned awkwardly at me when I complained to
him during office hours about not understanding one of Professor Wainwright’s
incessant “putting together the pieces” comment without any justification
whatsoever … maybe even advanced students struggle from time to time? And
Wainwright <em>does</em> have this to say in the first chapter of his book:</p>
<blockquote>
<p>Probably the most subtle requirement is a certain degree of mathematical
maturity on the part of the reader. This book is meant for the person who is
interested in gaining a deep understanding of the core issues in
highdimensional statistics. As with anything worthwhile in life, doing so
requires effort. This basic fact should be kept in mind while working through
the proofs, examples and exercises in the book.</p>
</blockquote>
<p>(I’m not sure if a “certain degree” is a good description, more like “very high
degree” wouldn’t you say?)</p>
<p>Again, I am dying to know:</p>
<blockquote>
<p>What is the difference between me and Professor Jordan? For instance, when we
each read Professor Wainwright’s textbook, why is he able to process and
understand the information at a much faster rate? Does his brain simply work
on a higher plane? Do I lack his intensity, drive, and/or focus? Am I
inherently less talented?</p>
</blockquote>
<p>I just don’t know.</p>
<h2 id="randomthoughts">Random Thoughts</h2>
<p>Here are a few other random thoughts and comments I have about the course:</p>
<ul>
<li>
<p>The course had recitations, which are onceaweek events when one of the TAs
leads a class section to discuss certain class concepts in more detail.
Attendance was optional, but since the recitations conflicted with one of my
research lab meetings, I didn’t attend a single recitation. Thus, I don’t know
what they were like. However, future students taking 210B should at least
attend one section to see if such sessions would be beneficial.</p>
</li>
<li>
<p>Yes, I had sign language interpreting services, which are my usual class
accommodations. Fortunately, I had a <a href="https://danieltakeshi.github.io/20151031thebenefitsofhavingthesamegroupofinterpreters/">consistent group</a> of two
interpreters who attended almost every class. They were quite kind enough to
bear through such technically demanding material, and I know that one of the
interpreters was sick once, but came to work anyway since she knew that
whoever would be substituting would be scarred to life from the class
material. Thanks to both of you<sup id="fnref:i_am_told"><a href="#fn:i_am_told" class="footnote">4</a></sup>, and I hope to continue working
with you in the future!</p>
</li>
<li>
<p>To make things easier for my sign language interpreters, I showed up early to
every class to arrange two seats for them. (In fact, beyond the first few
weeks, I think I was the first student to show up to every class, since in
addition to rearranging the chairs, I used the time to review the lecture
material from Wainwright’s book.) Once the other students in the class got
used to seeing the interpreters, they didn’t touch the two magical chairs.</p>
</li>
<li>
<p>We had a class Piazza. As usual, I posted way too many times there, but it was
interesting to see that we had a lot more discussion compared to 210A.</p>
</li>
<li>
<p>The class consisted of mostly PhD students in statistics, mathematics, EECS,
and mechanical engineering, but there were a few talented undergrads who
joined the party.</p>
</li>
</ul>
<h2 id="concludingthoughts">Concluding Thoughts</h2>
<p>I’d like to get back to that Facebook discussion between Yann LeCun and Michael
I. Jordan in the beginning of his post. Professor Jordan’s final paragraph was a
pleasure to read:</p>
<blockquote>
<p>Anyway, I keep writing these overlylong posts, and I’ve got to learn to
do better. Let me just make one additional remark, which is that I’m really
proud to be a member of a research community, one that includes Yann Le Cun,
Geoff Hinton and many others, where there isn’t just lipservice given to
respecting others’ opinions, but where there is real respect and real
friendship.</p>
</blockquote>
<p>I found this pleasing to read because I often find myself thinking similar
things. I too feel proud to be part of this field, even though I know I don’t
have a fraction of the contributions of those guys. I feel
privileged<sup id="fnref:privilege"><a href="#fn:privilege" class="footnote">5</a></sup> to be able to learn statistics and machine learning from
Professor Jordan and all the other professors I’ve encountered in my education.
My goal is to become a far better researcher than I am now so that I feel like I
am giving back to the community. That’s indeed one of the reasons why I started
this blog way back in August 2011 when I was hunched over my desk in the eighth
floor of a dorm at the University of Washington. I wanted a blog in part so that
I could discuss the work I’m doing and new concepts that I’ve learned, all while
making it hopefully accessible to many readers.</p>
<p>The other amusing thing that Professor Jordan and I have in common is that we
both write overly long posts, him on his Facebook, and me on my blog. It’s time
to get back to research.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:schmidhuber">
<p>The other two are Geoffrey Hinton and Yoshua Bengio. Don’t get
me started with Jürgen Schmidhuber, though he’s admittedly a clear fourth. <a href="#fnref:schmidhuber" class="reversefootnote">↩</a></p>
</li>
<li id="fn:interview">
<p>This came out of <a href="http://spectrum.ieee.org/robotics/artificialintelligence/machinelearningmaestromichaeljordanonthedelusionsofbigdataandotherhugeengineeringefforts">an interview that Professor Jordan had with IEEE
back in 2014</a>. However, it didn’t quite go as well as Professor Jordan
wanted, and he criticized the title and hype (see the featured comments
below at the article). <a href="#fnref:interview" class="reversefootnote">↩</a></p>
</li>
<li id="fn:sadly">
<p>Sadly, this “unfair advantage” has not translated in “getting noticed”
in other respects, such as friendship, dating, and so forth. <a href="#fnref:sadly" class="reversefootnote">↩</a></p>
</li>
<li id="fn:i_am_told">
<p>While I don’t advertise this blog to sign language interpreters, a
few years ago one of them said that there had been “some discussion” of my
blog among her social circle of interpreters. Interesting … <a href="#fnref:i_am_told" class="reversefootnote">↩</a></p>
</li>
<li id="fn:privilege">
<p>Even though that word has gotten a bad rap from the Social Justice
Warriors, it’s the right word here. <a href="#fnref:privilege" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 26 Jun 2017 05:00:00 0700
https://danieltakeshi.github.io/2017/06/26/reviewoftheoreticalstatisticsstat210batberkeley/
https://danieltakeshi.github.io/2017/06/26/reviewoftheoreticalstatisticsstat210batberkeley/The BAIR Blog is Now Live<p style="textalign:center;">
<img src="https://danieltakeshi.github.io/assets/BAIR_Logo_BlueType_Tag.png" width="600" />
</p>
<p><br /></p>
<p>The word should now be out that BAIR — short for Berkeley Artificial
Intelligence Research — now has a blog. The official <a href="http://bair.berkeley.edu/">BAIR website is here</a>
and the <a href="http://bair.berkeley.edu/blog/">blog is located here</a>.</p>
<p>I was part of the team which created and set up the blog. The blog was written
using Jekyll so for the most part I was able to utilize my prior Jekyll
knowledge from working on “Seita’s Place” (that name really sounds awful,
sorry).</p>
<p>One neat thing that I learned throughout this process was how to design a Jekyll
blog but then have it appear as a <em>subdirectory</em> inside an <em>existing</em> website
like the BAIR website with the correct URLs. The key is to understand two
things:</p>
<ul>
<li>
<p>The <code class="highlighterrouge">_site</code> folder generated when you build and preview Jekyll locally
contains all you need to build the website using normal HTML. Just copy over
the contents of this folder into wherever the server is located.</p>
</li>
<li>
<p>In order to get <em>links</em> set up correctly, it is first necessary to understand
how “baseurl”s work for project pages, among other things. <a href="https://byparker.com/blog/2014/clearingupconfusionaroundbaseurl/">This blog post</a>
and <a href="http://downtothewire.io/2015/08/15/configuringjekyllforuserandprojectgithubpages/">this other blog post</a> can clarify these concepts. Assuming you have
correct <code class="highlighterrouge">site.url</code> and <code class="highlighterrouge">site.baseurl</code> variables, to build the website, you
need to run</p>
<p><code class="highlighterrouge">
JEKYLL_ENV=production bundle exec jekyll serve
</code></p>
<p>The production mode aspect will automatically configure the contents of
<code class="highlighterrouge">_site</code> to contain the correct links. This is extremely handy — otherwise,
there would be a bunch of annoying <code class="highlighterrouge">http://localhost:4000</code> strings and we’d
have to run cumbersome findandreplace commands. The contents of this folder
can then be copied over to where the server is located.</p>
</li>
</ul>
<p>Anyway, enough about that. Please check out our inaugural blog post, about an
exciting concept called <em>Neural Module Networks</em>.</p>
Tue, 20 Jun 2017 02:00:00 0700
https://danieltakeshi.github.io/2017/06/20/thebairblogisnowlive/
https://danieltakeshi.github.io/2017/06/20/thebairblogisnowlive/OpenAI's Generative Adversarial Imitation Learning Code<p>In an <a href="https://danieltakeshi.github.io/2017/05/30/awspackerandopenaisevolutionstrategiescode">earlier blog post</a>, I described how to use OpenAI’s Evolution
Strategies code. In this post, I’ll provide a similar guide for their <a href="https://github.com/openai/imitation">imitation
learning code</a> which corresponds to the NIPS 2016 paper <em>Generative
Adversarial Imitation Learning</em>. While the code works and is quite robust (as
I’ll touch upon later), there’s little documentation and on the GitHub issues
page, people have asked variants of “please help me run the code!!” Thus, I
thought I’d provide some insight into how the code works. Just like the ES code,
it runs on a cluster, but I’ll specifically run it on a <em>single</em> machine to make
life easier.</p>
<p>The code was written in early 2016, so it uses Theano instead of TensorFlow. The
first task for me was therefore to install Theano on my Ubuntu 16.04 machine
with a TITAN X GPU. The imitation code is for Python 2.7, so I also decided to
install Anaconda. If I want to switch back to Python 3.5, then I think I can
modify my <code class="highlighterrouge">.bashrc</code> file to comment out the references to Anaconda, but maybe
it’s better for me to use virtual environments. I don’t know.</p>
<p>I then followed the installations to get the stable 0.9.0 version of Theano. My
configuration looks like this:</p>
<div class="highlighterrouge"><pre class="highlight"><code>[global]
floatX = float64
device = gpu
[cuda]
root = /usr/local/cuda8.0
</code></pre>
</div>
<p>Unfortunately, I ran into some nightmares with installing Theano. I hope you’re
not interested in the details; I <a href="https://groups.google.com/forum/#!topic/theanousers/_J7BxmP8DqA">wrote them here</a> on their Google Groups.
Let’s just say that their new “GPU backend” causes me more trouble than it’s
worth, which is why I kept the old <code class="highlighterrouge">device = gpu</code> setting. Theano still seems to
complain and spews out warnings about the <code class="highlighterrouge">float64</code> setting I have here, but I
don’t have much of a choice since the imitation code assumes double precision
floats.</p>
<p>Yeah, I’m definitely switching back to TensorFlow as soon as possible.</p>
<p>Back to the code — how does one run it? By calling <code class="highlighterrouge">scripts/im_pipeline.py</code>
three times, as follows:</p>
<div class="highlighterrouge"><pre class="highlight"><code>python scripts/im_pipeline.py pipelines/im_classic_pipeline.yaml 0_sampletrajs
python scripts/im_pipeline.py pipelines/im_classic_pipeline.yaml 1_train
python scripts/im_pipeline.py pipelines/im_classic_pipeline.yaml 2_eval
</code></pre>
</div>
<p>where the pipeline configuration file can be one of the four provided options
(or something that you provide). You can put these three commands in a bash
script so that they automatically execute sequentially.</p>
<p>If you run the commands onebyone from the imitation repository, you should
notice that the first one succeeds after a small change: get rid of the
<code class="highlighterrouge">Acrobotv0</code> task. That version no longer exists in OpenAI gym. You could train
version 1 using their TRPO code, but I opted to skip it for simplicity.</p>
<p>That first command generates expert trajectories to use as input data for
imitation learning. The second command is the heavyduty part of the code: the
actual imitation learning. It also needs some modification to get it to work for
a sequential setting, because the code compiles a list of commands to execute in
a cluster.</p>
<p>Those commands are all of the form <code class="highlighterrouge">python script_name.py [arg1] [arg2] ...</code>. I
decided to put them together in a list and then run them sequentially, which can
easily be done using this code snippet:</p>
<figure class="highlight"><pre><code class="languagepython" datalang="python"><span class="n">all_commands</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="o">**</span><span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">cmd_templates</span><span class="p">,</span><span class="n">argdicts</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">command</span> <span class="ow">in</span> <span class="n">all_commands</span><span class="p">:</span>
<span class="n">subprocess</span><span class="o">.</span><span class="n">call</span><span class="p">(</span><span class="n">command</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">))</span></code></pre></figure>
<p>This is nifty: the <code class="highlighterrouge">x.format(**y)</code> part looks odd, but <code class="highlighterrouge">x</code> is a string format in
Python with arguments to be filled in by the values of <code class="highlighterrouge">y</code>.</p>
<p>If running something like the above doesn’t quite work, you might want to check
the following:</p>
<ul>
<li>
<p>If you’re getting an error with pytables, it’s probably because you’re using
version 3.x of the library, which changed <code class="highlighterrouge">getNode</code> to <code class="highlighterrouge">get_node</code>. Someone
<a href="https://github.com/openai/imitation/pull/7">wrote a pull request for this</a> which should probably get integrated ASAP.
(Incidentally, pytables looks like a nice library for data management, and I
should probably consider using it in the near future.)</p>
</li>
<li>
<p>If you’re rerunning the code, you need to delete the appropriate output
directories. It can be annoying, but don’t remove this functionality! It’s too
easy to accidentally run a script that overrides your old data files. Just
manually delete them, it’s better.</p>
</li>
<li>
<p>If you get a lot of “Exception ignored” messages, go into
<code class="highlighterrouge">environments/rlgymenv.py</code> and comment out the <code class="highlighterrouge">__del__</code> method in the
<code class="highlighterrouge">RLGymSim</code> class. I’m not sure why that’s there. Perhaps it’s useful in
clusters to save memory? Removing the method didn’t seem to adversely impact
my code and it got rid of the warning messages, so I’m happy.</p>
</li>
<li>
<p>Someone else mentioned in <a href="https://github.com/openai/imitation/issues/3">this GitHub issue</a> that he had to disable
multithreading, but fortunately I didn’t seem to have this problem.</p>
</li>
</ul>
<p>Hopefully, if all goes well, you’ll see a long list of compressed files
containing relevant data for the runs. Here’s a snippet of the first few that I
see, assuming I used <code class="highlighterrouge">im_classic_pipeline.yaml</code>:</p>
<div class="highlighterrouge"><pre class="highlight"><code>alg=bclone,task=cartpole,num_trajs=10,run=0.h5
alg=bclone,task=cartpole,num_trajs=10,run=1.h5
alg=bclone,task=cartpole,num_trajs=10,run=2.h5
alg=bclone,task=cartpole,num_trajs=10,run=3.h5
alg=bclone,task=cartpole,num_trajs=10,run=4.h5
alg=bclone,task=cartpole,num_trajs=10,run=5.h5
alg=bclone,task=cartpole,num_trajs=10,run=6.h5
alg=bclone,task=cartpole,num_trajs=1,run=0.h5
alg=bclone,task=cartpole,num_trajs=1,run=1.h5
alg=bclone,task=cartpole,num_trajs=1,run=2.h5
alg=bclone,task=cartpole,num_trajs=1,run=3.h5
alg=bclone,task=cartpole,num_trajs=1,run=4.h5
alg=bclone,task=cartpole,num_trajs=1,run=5.h5
alg=bclone,task=cartpole,num_trajs=1,run=6.h5
</code></pre>
</div>
<p>The algorithm here is behavioral cloning, one of the four that the GAIL paper
benchmarked. The number of trajectories is 10 for the first seven files, then 1
for the others. These represent the “dataset size” quantities in the paper, so
the next set of files appearing after this would have 4 and then 7. Finally,
each dataset size is run seven times from seven different initializations, as
explained in the very last sentence in the appendix of the GAIL paper:</p>
<blockquote>
<p>For the cartpole, mountain car, acrobot, and reacher, these statistics are
further computed over 7 policies learned from random initializations.</p>
</blockquote>
<p>The third command is the evaluation portion, which takes the log files and
compresses it all into a single <code class="highlighterrouge">results.h5</code> file (or whatever you called it in
your <code class="highlighterrouge">.yaml</code> configuration file). I kept the code exactly the same as it was in
the original version, but note that you’ll need to have <em>all</em> the relevant
output files as specified in the configuration or else you’ll get errors.</p>
<p>When you run the evaluation portion, you should see for each policy instance,
its mean and standard deviation over 50 rollouts. For instance, with behavioral
cloning, the policy that’s chosen is the one that performed best on the
validation set. For the others, it’s whatever appeared at the final iteration of
the algorithm.</p>
<p>The last step is to arrange these results and plot them somehow. Unfortunately,
while you can get an informative plot using <code class="highlighterrouge">scripts/showlog.py</code>, I don’t think
there’s code in the repository to generate Figure 1 in the GAIL paper, so I
wrote some plotting code from scratch. For CartPolev0 and MountainCar, I got
the following results:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/gail_results.png" /> </p>
<p>These are comparable with what’s in the paper, though I find it interesting that
GAIL seems to choke with the size 7 and 10 datasets for CartPolev0. Hopefully
this is within the random noise. I’ll test with the harder environments shortly.</p>
<p><strong>Acknowledgments</strong>: I thank Jonathan Ho for releasing this code. I know it
seems like sometimes I (or other users) complain about lack of documentation,
but it’s still quite rare to see clean, functional code to exactly reproduce
results in research papers. The code base is robust and highly generalizable to
various settings. I also learned some new Python concepts from reading his code.
Jonathan Ho must be an allstar programmer.</p>
<p><strong>Next Steps</strong>: If you’re interested in running the GAIL code sequentially,
consider looking at <a href="https://github.com/DanielTakeshi/imitation">my fork here</a>. I’ve also added considerable
documentation.</p>
Thu, 15 Jun 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/06/15/openaisgenerativeadversarialimitationlearningcode/
https://danieltakeshi.github.io/2017/06/15/openaisgenerativeadversarialimitationlearningcode/AWS, Packer, and OpenAI's Evolution Strategies Code<p>I have very little experience with programming in clusters, so when OpenAI
released their <a href="https://github.com/openai/evolutionstrategiesstarter">evolution strategies starter code</a> which runs only on EC2
instances, I took this opportunity to finally learn how to program in clusters
the way professionals do it.</p>
<h2 id="amazonwebservices">Amazon Web Services</h2>
<p>The first task is to get an Amazon Web Services (AWS) account. AWS offers a
mindbogglingly large amount of resources for doing all sorts of cloud
computing. For our purposes, the most important feature is the Elastic Comptue
Cloud (EC2). The short description of these guys is that they allow me to run
code on heavilycustomized machines that I don’t own. The only catch is that
running code this way costs some money commensurate with usage, so watch out.</p>
<p>Note that joining AWS means we start off with one year of the freetier option.
This isn’t as good as it sounds, though, since many machines (e.g. those with
GPUs) are not eligible for free tier usage. You still have to watch your
budget.</p>
<p>One immediate aspect of AWS to understand are their security credentials. They
state (emphasis mine):</p>
<blockquote>
<p>You use different types of security credentials depending on how you interact
with AWS. For example, you use a user name and password to sign in to the AWS
Management Console. <strong>You use access keys to make programmatic calls to AWS
API actions.</strong></p>
</blockquote>
<p>To use the OpenAI code, I have to provide my AWS access key and <em>secret</em> access
keys, which are officially designated as <code class="highlighterrouge">AWS_ACCESS_KEY_ID</code> and
<code class="highlighterrouge">AWS_SECRET_ACCESS_KEY</code>, respectively. These aren’t initialized by default; we
have to explicitly create them. This means going to the Security Credentials
tab, and seeing:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/aws_console_01.png" /> </p>
<p>You can create root access and secret access keys this way, but this is <em>not</em>
the recommended way. To be clear, I took the above screenshot during the “root
perspective,” so make sure you’re not seeing this on your computer. AWS
<em>strongly recommends</em> to instead make a new user with administrative
requirements, which effectively means it’s as good as the root account (minus
the ability to view billing information). You can see <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/gettingstarted_createadmingroup.html">their official
instructions here</a> to create groups with administrative privileges. The way I
think of it, I’m a systems administrator and have to create a bunch of users for
a computer. Except here, I only need to create one. So maybe this is a bit
unnecessary, but I think it’s helpful to get used to the good practices as soon
as possible. <a href="https://alestic.com/2014/09/awsrootpassword/">This author</a> even suggests throwing away (!!) the root AWS
password.</p>
<p>After following those instructions I had a “new” user and created the two access
keys. These must be manually downloaded, where they’ll appear in a <code class="highlighterrouge">.csv</code> file.
Don’t lose them!</p>
<p>Next, we have to <em>provide</em> these credentials. When running packer code, as I’ll
show in the next section, it suffices to either provide them as command line
arguments, or use more secure ways such as adding them to your <code class="highlighterrouge">.bashrc</code> file. I
chose the latter. <a href="http://docs.aws.amazon.com/sdkforjava/v1/developerguide/credentials.html">This page from AWS</a> provides further information about how
to provide your credentials, and the packer documentation <a href="https://www.packer.io/docs/builders/amazon.html#specifyingamazoncredentials">contains similar
instructions</a>.</p>
<p>On a final note regarding AWS, I had a hard time figuring out how to actually
<em>log in</em> as the Administrator user, rather than the root. <a href="https://stackoverflow.com/questions/21834879/howtologintoawsconsolewithaniamuseraccount">This StackOverflow
question</a> really helped out, but I’m baffled as to why this isn’t easier to
do.</p>
<h2 id="installingandunderstandingpacker">Installing and Understanding Packer</h2>
<p>As stated in the OpenAI code, we must use something known as packer to run the
code. After installing it, I went through <a href="https://www.packer.io/intro/gettingstarted/buildimage.html">their basic example</a>. Notice that
in their <code class="highlighterrouge">.json</code> file, they have the following:</p>
<div class="languagejson highlighterrouge"><pre class="highlight"><code><span class="s2">"variables"</span><span class="err">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nt">"aws_access_key"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
</span><span class="nt">"aws_secret_key"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="p">}</span><span class="err">,</span><span class="w">
</span></code></pre>
</div>
<p>where the access and secret keys must be supplied in some way. They <em>could</em> be
hardcoded above if you want to type them in there, but as mentioned earlier, I
chose to use environment variables in <code class="highlighterrouge">.bashrc</code>.</p>
<p>Here are a couple of things to keep in mind when running packer’s basic example:</p>
<ul>
<li>
<p>Be patient when the <code class="highlighterrouge">packer build</code> command is run. It does not officially
conclude until one sees:</p>
<div class="highlighterrouge"><pre class="highlight"><code>==> Builds finished. The artifacts of successful builds are:
> amazonebs: AMIs were created:
useast1: ami19601070
</code></pre>
</div>
<p>where the last line will certainly be different if you run it.</p>
</li>
<li>
<p>The output, at least in this case, is an Amazon Machine Image (AMI) that I
own. Therefore, I will have to start paying a small fee if this image remains
active. There are two steps to deactivating this and ensuring that I don’t
have to pay: “deregistering” the image and <em>deleting</em> the (associated)
snapshot. For the former, go to the EC2 Management Console and see the <code class="highlighterrouge">IMAGES
/ AMIs</code> dropdown menu, and for the latter, use <code class="highlighterrouge">ELASTIC BLOCK STORE /
Snapshots</code>. From my experience, deregistering can take several minutes, so
just be patient. These have to happen in order, as deleting the snapshot first
will result in an error which says that the image is still using it.</p>
</li>
<li>
<p>When launching (or even when deactivating, for that matter) be careful about
the location you’re using. Look at the upper right corner for the locations.
The “useast1” region is “Northern Virginia” and that is where the image and
snapshot will be displayed. If you change locations, you won’t see them.</p>
</li>
<li>
<p>Don’t change the “region” argument in the “builders” list; it has to stay at
“useast1”. When I first fired this up and saw that my image and snapshot
were in “useast1” instead of the moredesirable “uswest1” (Northern
California) for me, I tried changing that argument and rebuilding. But then I
got an error saying that the image couldn’t be found.</p>
<p>I <em>think</em> what happens is that the provided “source_ami” argument is the
packer author’s fixed, base machine that he set up for the purposes of this
tutorial, with packer installed (and maybe some other stuff). Then the
<code class="highlighterrouge">.json</code> file we have copies that image, as suggested by this statement in the
docs (emphasis mine):</p>
<blockquote>
<p>Congratulations! You’ve just built your first image with Packer. Although
the image was pretty useless in this case (<strong>nothing was changed about
it</strong>), this page should’ve given you a general idea of how Packer works,
what templates are and how to validate and build templates into machine
images.</p>
</blockquote>
</li>
</ul>
<p>In packer’s <a href="https://www.packer.io/intro/gettingstarted/provision.html">slightly more advanced example</a>, we get to see what happens
when we want to <em>preinstall</em> some software on our machines, and it’s here where
we see packer’s benefits start to truly shine. In that new example, the
“provisions” list lets us run command line arguments to install desired packages
(i.e. <code class="highlighterrouge">sudo aptget install blahblahblah</code>). When I sshed into the generated
machine — a bit of a struggle at first since I didn’t realize the username to
get in was actually ubuntu instead of ec2user — I could successfully run
<code class="highlighterrouge">redisserver</code> on the command line and it was clear that the package had been
installed.</p>
<p>In OpenAI’s code, they have a full <em>script</em> of commands which they load in.
Thus, any image that we create from the packer build will have those commands
run, so that our machines will have exactly the kind of software we want. In
particular, OpenAI’s script installs TensorFlow, gym, the ALE, and so on. If we
didn’t have packer, I think we would have to manually execute that script for
all the machines. To give a sense of how slow that would be, the OpenAI ES paper
said they once tested with <em>1,440</em> machines.</p>
<h2 id="openaiscode">OpenAI’s Code</h2>
<p>The final stage is to understand how to run OpenAI’s code. As mentioned earlier,
there’s a <code class="highlighterrouge">dependency.sh</code> shell script which will install stuff on our
cloudcomputing machines. Unfortunately, MuJoCo is not open source.
(Fortunately, we might have an alternative with <a href="https://blog.openai.com/roboschool/">OpenAI’s RoboSchool</a> — I
hope to see that work out!) Thus, we have to add our own license. For me, this
was a twostage process.</p>
<p>First, in the configuration file, I added the following two <em>file provisioners</em>:</p>
<div class="languagejson highlighterrouge"><pre class="highlight"><code><span class="s2">"provisioners"</span><span class="err">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"file"</span><span class="p">,</span><span class="w">
</span><span class="nt">"source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/home/daniel/mjpro131"</span><span class="p">,</span><span class="w">
</span><span class="nt">"destination"</span><span class="p">:</span><span class="w"> </span><span class="s2">"~/"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"file"</span><span class="p">,</span><span class="w">
</span><span class="nt">"source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/home/daniel/mjpro131/mjkey.txt"</span><span class="p">,</span><span class="w">
</span><span class="nt">"destination"</span><span class="p">:</span><span class="w"> </span><span class="s2">"~/"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"shell"</span><span class="p">,</span><span class="w">
</span><span class="nt">"scripts"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"dependency.sh"</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre>
</div>
<p>In packer, the elements in the “provisioners” array are executed in order of
their appearance, so I wanted the files sent over to the home directory on the
images so that they’d be there for the shell script later. The “source” strings
are where MuJoCo is stored on my personal machine, the one which executes
<code class="highlighterrouge">packer build packer.json</code>.</p>
<p>Next, inside <code class="highlighterrouge">dependency.sh</code>, I simply added the following two <code class="highlighterrouge">sudo mv</code>
commands:</p>
<div class="highlighterrouge"><pre class="highlight"><code>#######################################################
# WRITE CODE HERE TO PLACE MUJOCO 1.31 in /opt/mujoco #
# The key file should be in /opt/mujoco/mjkey.txt #
# Mujoco should be installed in /opt/mujoco/mjpro131 #
#######################################################
sudo mv ~/mjkey.txt /opt/mujoco/
sudo mv ~/mjpro131 /opt/mujoco/
</code></pre>
</div>
<p>(Yes, we’re still using MuJoCo 1.31. I’m not sure why the upgraded versions
don’t work.)</p>
<p>This way, when running <code class="highlighterrouge">packer build packer.json</code>, the relevant portion of the
output should look something like this:</p>
<div class="languageshell highlighterrouge"><pre class="highlight"><code>amazonebs: + sudo mkdir p /opt/mujoco
amazonebs: + sudo mv /home/ubuntu/mjkey.txt /opt/mujoco/
amazonebs: + sudo mv /home/ubuntu/mjpro131 /opt/mujoco/
amazonebs: + sudo tee /etc/profile.d/mujoco.sh
amazonebs: + sudo <span class="nb">echo</span> <span class="s1">'export MUJOCO_PY_MJKEY_PATH=/opt/mujoco/mjkey.txt'</span>
amazonebs: + sudo tee a /etc/profile.d/mujoco.sh
amazonebs: + sudo <span class="nb">echo</span> <span class="s1">'export MUJOCO_PY_MJPRO_PATH=/opt/mujoco/mjpro131'</span>
amazonebs: + . /etc/profile.d/mujoco.sh
</code></pre>
</div>
<p>where the <code class="highlighterrouge">sudo mv</code> commands have successfully moved my MuJoCo materials over to
the desired target directory.</p>
<p>As an aside, I should also mention the other change I made to <code class="highlighterrouge">packer.json</code>: in
the “ami_regions” argument, I deleted all regions except for “uswest1”, since
otherwise images would be created in <em>all</em> the regions listed.</p>
<p>Running <code class="highlighterrouge">packer build packer.json</code> takes about thirty minutes to run. Upon
concluding, I saw the following output:</p>
<div class="highlighterrouge"><pre class="highlight"><code>==> Builds finished. The artifacts of successful builds are:
> amazonebs: AMIs were created:
uswest1: amiXXXXXXXX
</code></pre>
</div>
<p>where for security reasons, I have not revealed the full ID. Then, inside
<code class="highlighterrouge">launch.py</code>, I put in:</p>
<figure class="highlight"><pre><code class="languagepython" datalang="python"><span class="c"># This will show up under "My AMIs" in the EC2 console.</span>
<span class="n">AMI_MAP</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"uswest1"</span><span class="p">:</span> <span class="s">"amiXXXXXXXX"</span>
<span class="p">}</span> </code></pre></figure>
<p>The last step is to call the launcher script with the appropriate arguments.
Before doing so, <em>make sure you’re using Python 3</em>. I originally ran this with
Python 2.7 and was getting some errors. (Yeah, yeah, I still haven’t changed
even though I said I would do so four years ago; blame backwards
incompatibility.) One easy way to manage different Python versions on one
machine is to use Python virtual environments. I started a new one with Python
3.5 and was able to get going after a few <code class="highlighterrouge">pip install</code> commands.</p>
<p>You can find the necessary arguments in the <code class="highlighterrouge">main</code> method of <code class="highlighterrouge">launch.py</code>. To
understand these arguments, it can be helpful to look at the <code class="highlighterrouge">boto3</code>
documentation, which is the Python library that interfaces with AWS. In
particular, reading the <code class="highlighterrouge">create_instances</code> <a href="http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.Subnet.create_instances">documentation</a> will be useful.</p>
<p>I ended up using:</p>
<div class="highlighterrouge"><pre class="highlight"><code>python launch.py ../configurations/humanoid.json \
key_name="MyKeyPair" \
s3_bucket="s3://putnamehere" \
region_name="uswest1" \
zone="uswest1b" \
master_instance_type="m4.large" \
worker_instance_type="t2.micro" \
security_group="default" \
spot_price="0.05"
</code></pre>
</div>
<p>A few pointers:</p>
<ul>
<li>Make sure you run <code class="highlighterrouge">sudo apt install awscli</code> if you don’t have the package
already installed.</li>
<li>Double check the default arguments for the two access keys. They’re slightly
different than what I used in the packer example, so I adjusted my <code class="highlighterrouge">.bashrc</code>
file.</li>
<li>“MyKeyPair” comes from the <code class="highlighterrouge">MyKeyPair.pem</code> file which I created via the EC2
console.</li>
<li>The <code class="highlighterrouge">s3_bucket</code> argument is based on <a href="http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html">AWS Simple Storage Service</a>. I made
my own unique bucket name via the S3 console, and to actually provide it as an
argument, write it as <code class="highlighterrouge">s3://putnamehere</code> where <code class="highlighterrouge">putnamehere</code> is what you
created.</li>
<li>The <code class="highlighterrouge">region_name</code> should be straightforward. The <code class="highlighterrouge">zone</code> argument is similar,
except we add letters at the end since they can be thought of as “subsets” of
the regions. Not all zones will be available to you, since AWS adjusts what
you can use so that it can more effectively achieve load balancing for its
entire service.</li>
<li>The <code class="highlighterrouge">master_instance_type</code> and <code class="highlighterrouge">worker_instance_type</code> arguments are the names
of the instance types; <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancetypes.html">see this for more information</a>. It turns out that
the master requires a more advanced (and thus more expensive) type due to <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html">EBS
optimization</a>. I chose t2.micro for the workers, which seems to work and is
better for me since that’s the only type eligible for the free tier.</li>
<li>The <code class="highlighterrouge">security_group</code>s you have can be found in the EC2 console under <code class="highlighterrouge">NETWORK
& SECURITY / Security Groups</code>. Make sure you use the <em>name</em>, not the <em>ID</em>; the
names are NOT the strings that look like “sgXYZXYZXYZ”. Watch out!</li>
<li>
<p>Finally, the <code class="highlighterrouge">spot_price</code> indicates the maximum amount to bid, since we’re
using “Spot Instances” rather than “On Demand” pricing. OpenAI’s README says:</p>
<blockquote>
<p>It’s resilient to worker termination, so it’s safe to run the workers on
spot instances.</p>
</blockquote>
<p>The README says that because spot instances can be terminated if we are
outbid.</p>
</li>
</ul>
<p>By the way, to be clear on what I mean when I talk about the “EC2 Console” and
“S3 Console”, here’s the <em>general</em> AWS console:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/aws_console_02.png" /> </p>
<p>The desired consoles can be accessed by clicking “EC2” or “S3” in the above.</p>
<p>If all goes well, you should see a message like this:</p>
<div class="highlighterrouge"><pre class="highlight"><code>Scaling group created
humanoid_20170530133848 launched successfully.
Manage at [Link Removed]
</code></pre>
</div>
<p>Copy and paste the link in your browser, and you will see your instance there,
running OpenAI’s code.</p>
Tue, 30 May 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/05/30/awspackerandopenaisevolutionstrategiescode
https://danieltakeshi.github.io/2017/05/30/awspackerandopenaisevolutionstrategiescodeDeep Reinforcement Learning (CS 294112) at Berkeley, Take Two<p>Back in Fall 2015, I took the first edition of <em>Deep Reinforcement Learning</em> (CS
294112) at Berkeley. As usual, I <a href="https://danieltakeshi.github.io/20151217reviewofdeepreinforcementlearningcs294112atberkeley/">wrote a blog post</a> about the class; you
can find more about other classes I’ve taken by <a href="https://danieltakeshi.github.io/archive.html">searching the archives</a>.</p>
<p>In that blog post, I admitted that CS 294112 had several weaknesses, and also
that I didn’t quite fully understand the material. Fast forward to today, and
I’m pleased to say that:</p>
<ul>
<li>
<p>There has been a second edition of CS 294112, taught this past spring
semester. It was a threecredit, full semester course and therefore more
substantive than the previous edition which was twocredits and lasted only
eight weeks. Furthermore, the slides, homework assignments, <em>and</em> the lecture
recordings are all publicly available online. Check out <a href="http://rll.berkeley.edu/deeprlcourse/">the course
website</a> for details. You can find the homework assignments <a href="https://github.com/berkeleydeeprlcourse/homework">in this GitHub
repository</a> (I had to search a bit for this).</p>
</li>
<li>
<p>I now understand much more about deep reinforcement learning and about how to
use TensorFlow.</p>
</li>
</ul>
<p>These developments go hand in hand, because I spent much of the second half of
the Spring 2017 semester selfstudying the second edition of CS 294112. (To be
clear, I was not enrolled in the class.) I know I said I would first selfstudy
a few other courses <a href="https://danieltakeshi.github.io/20160220thefourclassesthatihaveselfstudied/">in a previous blog post</a>, but I couldn’t pass up such a
prime opportunity to learn about deep reinforcement learning. Furthermore, the
field moves so fast that I worried that if I didn’t follow what was happening
<em>now</em>, I would <em>never</em> be able to catch up to the research frontier if I tried
to do so in a year.</p>
<p>The class had four homework assignments, and I completed all of them with the
exception of skipping the DAgger algorithm implementation in the first homework.
The assignments were extremely helpful for me to understand how to better use
TensorFlow, and I finally feel comfortable using it for my personal projects.
If I can spare the time (famous last words) I plan to write some
TensorFlowrelated blog posts.</p>
<p>The video lecture were a nice bonus. I only watched a fraction of them, though.
This was in part due to time constraints, but also in part due to the lack of
captions. The lecture recordings are on YouTube, and in YouTube, I can turn on
automatic captions which helps me to follow the material. However, some of the
videos didn’t enable that option, so I had to skip those and just read the
slides since I wasn’t following what was being said. As far as I remember,
automatic captions are provided as an option so long as whoever uploaded the
video enables some setting, so maybe someone forgot to do so? Fortunately, the
lecture video on policy gradients has captions enabled, so I was able to watch
that one. Oh, and <a href="https://danieltakeshi.github.io/2017/03/28/goingdeeperintoreinforcementlearningfundamentalsofpolicygradients/">I wrote a blog post about the material</a>.</p>
<p>Another possible downside to the course, though this one is extremely minor, is
that the last few class sessions were <em>not</em> recorded, since those were when
students presented their final projects. Maybe the students wanted some level of
privacy? Oh well, I suppose there’s way too many other interesting projects
available anyway (by searching GitHubs, arXiv preprints, etc.) to worry about
this thing.</p>
<p>I want to conclude with a huge thank you to the course staff. Thank you for
helping to spread knowledge about deep reinforcement learning with a great class
and with lots of publicly available material. I really appreciate it.</p>
Wed, 24 May 2017 13:00:00 0700
https://danieltakeshi.github.io/2017/05/24/deepreinforcementlearningcs294112atberkeleytaketwo
https://danieltakeshi.github.io/2017/05/24/deepreinforcementlearningcs294112atberkeleytaketwoAlan Turing: The Enigma<p>I finished reading Andrew Hodges’ book <em>Alan Turing: The Engima</em>, otherwise
known as the definitive biography of mathematician, computer scientist, and code
breaker Alan Turing. I was inspired to read the book in part because I’ve been
reading lots of AIrelated books this year<sup id="fnref:reading_list"><a href="#fn:reading_list" class="footnote">1</a></sup> and in just about
every one of those books, Alan Turing is mention in some form. In addition, I
saw the film <em>The Imitation Game</em>, and indeed this is the book that inspired it.
I bought the 2014 edition of the book — with <em>The Imitation Game</em> cover —
during a recent visit to the <a href="https://www.nsa.gov/about/cryptologicheritage/museum/">National Cryptology Museum</a>.</p>
<p>The author is Andrew Hodges, who at that time was a mathematics instructor at
the University of Oxford (he’s now retired). He maintains a website where he
commemorates Alan Turing’s life and achievements. I encourage the interested
reader to <a href="http://www.turing.org.uk/index.html">check it out</a>. Hodges has the qualifications to write about the
book, being deeply versed in mathematics. He also appears to be gay
himself.<sup id="fnref:just_saying"><a href="#fn:just_saying" class="footnote">2</a></sup></p>
<p>After reading the book, my immediate thoughts relating to the <em>positive</em> aspects
of the books are:</p>
<ul>
<li>
<p>The book is organized chronologically and the eight chapters are indicated
with date ranges. Thus, for a biography of this size, it is relatively
straightforward to piece together a mental timeline of Alan Turing’s life.</p>
</li>
<li>
<p>The book is <em>detailed</em>. Like, <em>wow</em>. The edition I have is 680 pages, not
counting the endnotes at the back of the book which command an extra 30 or so
pages. Since I read almost every word of this book (I skipped a few endnotes),
and because I tried to stay alert when reading this book, I felt like I got a
clear picture of Turing’s life, along with what life must have been like
during the World War IIera.</p>
</li>
<li>
<p>The book contains quotes and writings from Turing that show just how far ahead
of his time he was. For instance, even today people are still utilizing
concepts from his famous 1936 paper <em>On Computable Numbers, with an
Application to the Entscheidungsproblem</em> and his 1950 paper <em>Computing
Machinery and Intelligence</em>. The former introduced Turing Machines, the latter
introduced the famous <a href="https://en.wikipedia.org/wiki/Turing_test">Turing Test</a>. Fortunately, I don’t think there was
much exaggeration of Turing’s accomplishments, unlike the <em>The Imitation
Game</em>. When I was reading his quotes, I often had to remind myself that “this
is the 1940s or 1950s ….”</p>
</li>
<li>
<p>The book showcases the struggles of being gay, particularly during a time when
homosexual activity was a crime. The book actually doesn’t seem to cover some
of his struggles in the early 1950s as much as I thought it would be, but it
was probably difficult to find sufficient references for this aspect of his
life. At the very least, readers today should appreciate how much our attitude
towards homosexuality has improved.</p>
</li>
</ul>
<p>That’s not to say there weren’t a few downsides. Here are some I thought of:</p>
<ul>
<li>
<p>Related to what I mentioned earlier, it is <em>long</em>. It too me a month to
finish, and the writing is in “1983style” which makes it more difficult for
me to understand. (By contrast, I read <em>both</em> of Richard Dawkins’ recent
autobiographies, which combine to be roughly the same length as Hodges’ book,
and Dawkins’ books were much easier to read.) Now, I find Turing’s life very
interesting so this is more of a “neutral” factor to me, but I can see why the
casual reader might be dissuaded from reading this book.</p>
</li>
<li>
<p>Much of the material is technical even to me. I understand the basics of
Turing Machines but certainly not how the early computers were built. The
hardest parts of the book to read are probably in chapters six and seven (out
of eight total). I kept asking to myself “what’s a cathode ray”?</p>
</li>
</ul>
<p>To conclude, the book is an extremely detailed overview of Turing’s life which
at times may be technically challenging to read.</p>
<p>I wonder what Alan Turing would think about AI today. The widelyused AI
undergraduate textbook by Stuart Russell and Peter Norvig concludes with the
follow prescient quote by Turing:</p>
<blockquote>
<p>We can only see a short distance ahead, but we can see plenty there that needs
to be done.</p>
</blockquote>
<p>Earlier scientists have an advantage in setting their legacy in their fields
since it’s easier to make landmark contributions. I view Charles Darwin, for
instance, as the greatest biologist who has ever lived, and no matter how
skilled today’s biologists are, I believe none will ever be able to surpass
Darwin’s impact. The same goes today for Alan Turing, who (possibly along with
John von Neumann) is one of the two preeminent computer scientists who has ever
lived.</p>
<p>Despite all the talent that’s out there in computer science, I don’t think any
one individual can possibly surpass Turing’s legacy on computer science and
artificial intelligence.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:reading_list">
<p>Thus, the 2017 edition of my reading list post (<a href="https://danieltakeshi.github.io/2016/12/31/allthebooksireadin2016plusmythoughtslong">here’s the
2016 version, if you’re wondering</a>) is going to be <em>very</em> biased in terms
of AI. Stay tuned! <a href="#fnref:reading_list" class="reversefootnote">↩</a></p>
</li>
<li id="fn:just_saying">
<p>I only say this because people who are members of “certain
groups” — where membership criteria is not due to choice but due to
intrinsic human characteristics — tend to have more knowledge about the
group than “outsiders.” Thus, a gay person by default has extra credibility
when writing about being gay than would a straight person. A deaf person by
default has extra credibility when writing about deafness than a hearing
person. And so on. <a href="#fnref:just_saying" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 21 May 2017 12:00:00 0700
https://danieltakeshi.github.io/2017/05/21/alanturingtheenigma
https://danieltakeshi.github.io/2017/05/21/alanturingtheenigmaUnderstanding Deep Learning Requires Rethinking Generalization: My Thoughts and Notes<p>The paper “Understanding Deep Learning Requires Rethinking Generalization”
(<a href="https://arxiv.org/abs/1611.03530">arXiv link</a>) caused quite a stir in the Deep Learning and Machine Learning
research communities. It’s the rare paper that seems to have high research merit
— judging from being awarded one of three <em>Best Paper</em> awards at <a href="http://www.iclr.cc/doku.php?id=ICLR2017:main&redirect=1">ICLR
2017</a> — but is <em>also</em> readable. Hence, it got the most amount of comments
of any ICLR 2017 submission on <a href="https://openreview.net/forum?id=Sy8gdB9xx&noteId=Sy8gdB9xx">OpenReview</a>. It has also been discussed on
<a href="https://www.reddit.com/r/MachineLearning/comments/5cw3lr/r_161103530_understanding_deep_learning_requires/">reddit</a> and was recently featured on <a href="https://blog.acolyer.org/2017/05/11/understandingdeeplearningrequiresrethinkinggeneralization/"><em>The Morning Paper</em></a> blog. I was
aware of the paper shortly after it was uploaded to arXiv, but never found the
time to read it in detail until now.</p>
<p>I enjoyed reading the paper, and while I agree with many readers that some of
the findings might be obvious, the paper nonetheless seems deserving of the
attention it has been getting.</p>
<p>The authors conveniently put two of their important findings in centered
italics:</p>
<blockquote>
<p>Deep neural networks easily fit random labels.</p>
</blockquote>
<p>and</p>
<blockquote>
<p>Explicit regularization may improve generalization performance, but is neither
necessary nor by itself sufficient for controlling generalization error.</p>
</blockquote>
<p>I will also quote another contribution from the paper that I find interesting:</p>
<blockquote>
<p>We complement our empirical observations with a theoretical construction
showing that generically large neural networks can express any labeling of the
training data.</p>
</blockquote>
<p>(I go through the derivation later in this post.)</p>
<p>Going back to their first claim about deep neural networks fitting random
labels, what does this mean from a <em>generalization perspective</em>?
(Generalization is just the difference between training error and testing
error.) It means that we cannot come up with a “generalization function” that
can take in a neural network as input and output a generalization quality score.
Here’s my intuition:</p>
<ul>
<li>
<p><strong>What we want</strong>: let’s imagine an arbitrary encoding of a neural network
designed to give as much deterministic information as possible, such as the
architecture and hyperparameters, and then use that encoding as input to a
generalization function. We want that function to give us a number
representing generalization quality, assuming that the datasets are allowed to
vary. The worst generalization occurs when a fixed neural network gets
excellent training error but could get either the <em>same</em> testing error
(awesome!), or get testset performance no better than <em>random guessing</em>
(ugh!).</p>
</li>
<li>
<p><strong>Reality</strong>: unfortunately, the best we can do seems to be no better than the
worst case. We know of no function that can provide bounds on generalization
performance across all datasets. Why? Let’s use the LeNet architecture and
MNIST as an example. With the right architecture, generalization error is very
small as both training and testing performance are in the high 90 percentages.
With a second data set that consists of the <em>same</em> MNIST digits, but with the
<em>labels randomized</em>, that same LeNet architecture can do no better than random
guessing on the test set, even though the <em>training</em> performance is extremely
good (or at least, it should be). That’s <em>literally</em> as bad as we can get.
There’s no point in developing a function to measure generalization when we
know it can only tell us that generalization will be in between zero (i.e.
perfect) and the difference between zero and random guessing (i.e. the worst
case)!</p>
</li>
</ul>
<p>As they later discuss in the paper, regularization can be used to improve
generalization, but will not be sufficient for developing our desired
generalization criteria.</p>
<p>Let’s briefly take a step back and consider classical machine learning, which
provides us with generalization criteria such as VCdimension, Rademacher
complexity, and uniform stability. I learned about VCdimension during my
undergraduate machine learning class, Rademacher complexity during STAT 210B
this past semester, and … actually I’m not familiar with uniform stability.
But <em>intuitively</em> … it makes sense to me that classical criteria do not apply
to deep networks. To take the Rademacher complexity example: a function class
which can fit to arbitrary <script type="math/tex">\pm 1</script> noise vectors presents the trivial bound of
one, which is like saying: “generalization is between zero and the worst case.”
Not very helpful.</p>
<p>The paper then proceeds to describe their testing scenario, and packs some
important results in the figure reproduced below:</p>
<p style="textalign:center;"> <img src="https://danieltakeshi.github.io/assets/understanding_dl_rethinking_gen.png" /> </p>
<p>This figure represents a neural network classifying the images in the
widelybenchmarked CIFAR10 dataset. The network the authors used is a
simplified version of the Inception architecture.</p>
<ul>
<li>
<p>The first subplot represents five different settings of the labels and input
images. To be clear on what the “gaussian” setting means, they use a Gaussian
distribution to generate <em>random pixels</em> (!!) for every image. The mean and
variance of that Gaussian are “matched to the original dataset.” In addition,
the “shuffled” and “random” pixels apply a random permutation to the pixels,
with the <em>same</em> permutation to all images for the former, and <em>different</em>
permutations for the latter.</p>
<p>We immediately see that the neural network can get zero training error on
<em>all</em> the settings, but the convergence speed varies. Intuition suggests that
the dataset with the correct labels and the one with the same shuffling
permutation should converge quickly, and this indeed is the case.
Interestingly enough, I thought the “gaussian” setting would have the worst
performance, but that prize seems to go to “random labels.”</p>
</li>
<li>
<p>The second subplot measures training error when the amount of label noise is
varied; with some probability <script type="math/tex">p</script>, each image independently has its labeled
corrupted and replaced with a draw from the discrete uniform distribution over
the classes. The results show that more corruption slows convergence, which
makes sense. By the way, using a continuum of something is a common research
tactic and something I should try for my own work.</p>
</li>
<li>
<p>Finally, the third subplot measures generalization error under label
corruption. As these data points were all measured <em>after</em> convergence, this
is equivalent to the test error. The results here also make a lot of sense.
Test set error should be approaching 90 percent because CIFAR10 has 10
classes (that’s why it’s called CIFAR10!).</p>
</li>
</ul>
<p>My major criticism of this figure is <em>not</em> that the results, particularly in the
second and third subplots, might seem obvious but that the figure <em>lacks error
bars</em>. Since it’s easy nowadays to program multiple calls in a bash script or
something similar, I would expect at least three trials and with error bars (or
“regions”) to each curve in this figure.</p>
<p>The next section discusses the role of regularization, which is normally
applied to prevent overfitting to the training data. The classic example is with
linear regression and a dataset of several points arranged in roughly a linear
fashion. Do we try to fit a straight line through these points, which might have
lots of training error, or do we take a highdimensional polynomial and fit
<em>every</em> point exactly, even if the resulting curve looks impossibly crazy?
That’s what regularization helps to control. Explicit regularization in linear
regression is the <script type="math/tex">\lambda</script> term in the following optimization problem:</p>
<script type="math/tex; mode=display">\min_w \Xw  y\_2^2 + \lambda \w\_2^2</script>
<p>I presented this <a href="https://danieltakeshi.github.io/2016/08/05/ausefulmatrixinverseequalityforridgeregression/">in an earlier blog post</a>.</p>
<p>To investigate the role of regularization in Deep Learning, the authors test
with and without regularizers. Incidentally, the use of <script type="math/tex">\lambda</script> above is not
the only type of regularization. There are also several others: <strong>data
augmentation</strong>, <strong>dropout</strong>, <strong>weight decay</strong>, <strong>early stopping</strong> (implicit) and
<strong>batch normalization</strong> (implicit). These are standard tools in the modern Deep
Learning toolkit.</p>
<p>They find that, while regularization helps to improve generalization
performance, it is still possible to get excellent generalization even with <em>no</em>
regularization. They conclude:</p>
<blockquote>
<p>In summary, our observations on both explicit and implicit regularizers are
consistently suggesting that regularizers, when properly tuned, could help to
improve the generalization performance. However, it is unlikely that the
regularizers are the fundamental reason for generalization, as the networks
continue to perform well after all the regularizers [are] removed.</p>
</blockquote>
<p>On a side note, the regularization discussion in the paper feels out of order
and the writing sounds a bit off to me. I wish they had more time to fix this,
as the regularization portion of the paper contains most of my English
languagerelated criticism.</p>
<p>Moving on, the next section of the paper is about <strong>finitesample
expressivity</strong>, or understanding what functions neural networks can express
<em>given a finite number of samples</em>. The authors state that the previous
literature focuses on <em>population analysis</em> where one can assume an arbitrary
number of samples. Here, instead, they assume a <em>fixed</em> set of <script type="math/tex">n</script> training
points <script type="math/tex">\{x_1,\ldots,x_n\}</script>. This seems easier to understand anyway.</p>
<p>They prove a theorem that relates to the third major contribution I wrote
earlier: “that generically large neural networks can express any labeling of the
training data.” Before proving the theorem, let’s begin with the following
lemma:</p>
<blockquote>
<p><strong>Lemma 1.</strong> For any two interleaving sequences of <script type="math/tex">n</script> real numbers</p>
<script type="math/tex; mode=display">% <![CDATA[
b_1 < x_1 < b_2 < x_2 \cdots < b_n < x_n %]]></script>
<p>the <script type="math/tex">n \times n</script> matrix <script type="math/tex">A = [\max\{x_i  b_j, 0\}]_{ij}</script> has full rank.
Its smallest eigenvalue is <script type="math/tex">\min_i (x_i  b_i)</script>.</p>
</blockquote>
<p>Whenever I see statements like these, my first instinct is to draw out the
matrix. And here it is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
A &=
\begin{bmatrix}
\max\{x_1b_1, 0\} & \max\{x_1b_2, 0\} & \cdots & \max\{x_1b_n, 0\} \\
\max\{x_2b_1, 0\} & \max\{x_2b_2, 0\} & \cdots & \max\{x_2b_n, 0\} \\
\vdots & \ddots & \ddots & \vdots \\
\max\{x_nb_1, 0\} & \max\{x_nb_2, 0\} & \cdots & \max\{x_nb_n, 0\}
\end{bmatrix} \\
&\;{\overset{(i)}{=}}\;
\begin{bmatrix}
x_1b_1 & 0 & 0 & \cdots & 0 \\
x_2b_1 & x_2b_2 & 0 & \cdots & 0 \\
\vdots & \ddots & \ddots & \ddots & \vdots \\
x_{n1}b_1 & x_{n1}b_2 & \ddots & \cdots & 0 \\
x_nb_1 & x_nb_2 & x_nb_3 & \cdots & x_nb_n
\end{bmatrix}
\end{align} %]]></script>
<p>where (i) follows from the interleaving sequence assumption. This matrix is
lowertriangular, and moreover, all the nonzero elements are positive. We know
from linear algebra that lower triangular matrices</p>
<ul>
<li>are invertible if and only if the diagonal elements are nonzero</li>
<li>have their eigenvalues taken directly from the diagonal elements</li>
</ul>
<p>These two facts together prove Lemma 1. Next, we can prove:</p>
<blockquote>
<p><strong>Theorem 1</strong>. There exists a twolayer neural network with ReLU activations
and <script type="math/tex">2n + d</script> weights that can represent any function on a sample of size
<script type="math/tex">n</script> in <script type="math/tex">d</script> dimensions.</p>
</blockquote>
<p>Consider the function</p>
<script type="math/tex; mode=display">c(x) = \sum_{j=1}^n w_j \cdot \max\{a^Txb_j,0\}</script>
<p>with <script type="math/tex">w, b \in \mathbb{R}^n</script> and <script type="math/tex">a,x\in \mathbb{R}^d</script>. (There’s a typo in
the paper, <script type="math/tex">c</script> is a function from <script type="math/tex">\mathbb{R}^d\to \mathbb{R}</script>, not
<script type="math/tex">\mathbb{R}^n\to \mathbb{R}</script>). This can certainly be represented by a depth2
ReLU network. To be clear on <a href="http://cs231n.github.io/neuralnetworks1/">the naming convention</a>, “depth2” does not
count the input layer, so our network should only have one ReLU layer in it as
the output shouldn’t have ReLUs applied to it.</p>
<p>Here’s how to think of the network representing <script type="math/tex">c</script>. First, assume that we
have a <em>minibatch</em> of <script type="math/tex">n</script> elements, so that <script type="math/tex">X</script> is the <script type="math/tex">n\times d</script> data
matrix. The depth2 network representing <script type="math/tex">c</script> can be expressed as:</p>
<script type="math/tex; mode=display">% <![CDATA[
c(X) =
\max\left(
\underbrace{\begin{bmatrix}
\texttt{} & x_1 & \texttt{} \\
\vdots & \vdots & \vdots \\
\texttt{} & x_n & \texttt{} \\
\end{bmatrix}}_{n\times d}
\underbrace{\begin{bmatrix}
\mid & & \mid \\
a & \cdots & a \\
\mid & & \mid
\end{bmatrix}}_{d \times n}

\underbrace{\begin{bmatrix}
b_1 & \cdots & b_n
\end{bmatrix}}_{1\times n}
, \;\;
\underbrace{\begin{bmatrix}
0 & \cdots & 0
\end{bmatrix}}_{1\times n}
\right)
\cdot
\begin{bmatrix}
w_1 \\ \vdots \\ w_n
\end{bmatrix} %]]></script>
<p>where <script type="math/tex">b</script> and the zerovector used in the maximum “broadcast” as necessary in
Python code.</p>
<p>Given a fixed dataset <script type="math/tex">S=\{z_1,\ldots,z_n\}</script> of distinct inputs with labels
<script type="math/tex">y_1,\ldots,y_n</script>, we must be able to find settings of <script type="math/tex">a,w,</script> and <script type="math/tex">b</script> such
that <script type="math/tex">c(z_i)=y_i</script> for all <script type="math/tex">i</script>. You might be guessing how we’re doing this:
we must reduce this to the interleaving property in Lemma 1. Due to the
uniqueness of the <script type="math/tex">z_i</script>, it is possible to find <script type="math/tex">a</script> to make the
<script type="math/tex">x_i=z_i^Ta</script> terms satisfy the interleaving property. Then we have a full rank
solution, hence <script type="math/tex">y=Aw</script> results in <script type="math/tex">w^* = A^{1}y</script> as our final weights,
where <script type="math/tex">A</script> is precisely that matrix from Lemma 1! We also see that, indeed,
there are <script type="math/tex">n+n+d</script> weights in the network. This is an interesting and fun
proof, and I think variants of this question would work well as a homework
assignment for a Deep Learning class.</p>
<p>The authors conclude the paper by trying to understand generalization with
<em>linear</em> models, in the hope that some of the intuition will transfer over to
the Deep Learning setting. With linear models, given some weights <script type="math/tex">w</script>
resulting from the optimization problem, what can we say about generalization
just by looking at it? Curvature is one popular metric to understand the
<em>quality</em> of the minima (which is not necessarily the same as the generalization
criteria!), but the Hessian is independent of <script type="math/tex">w</script>, so in fact it seems
impossible to use curvature for generalization. I’m convinced this is true for
the normal mean square loss, but is this still true if the loss function were,
say, the <em>cube</em> of the <script type="math/tex">L_2</script> difference? After all, there are only two
derivatives applied on <script type="math/tex">w</script>, right?</p>
<p>The authors instead urge us to think of stochastic gradient descent instead of
curvature when trying to measure quality. Assuming that <script type="math/tex">w_0=0</script>, the
stochastic gradient descent update consists of a series of “linear combination”
updates, and hence the result is just a linear combination <em>of</em> linear
combinations <em>of</em> linear combinations … (and so forth) … which at the end of
the day, remains a linear combination. (I don’t think they need to assume
<script type="math/tex">w_0=0</script> if we can add an extra 1 to all the data points.) Consequently, they
can fit any set of labels of the data by solving a linear equation, and indeed,
they get strong performance on MNIST and CIFAR10, even <em>without</em>
regularization.</p>
<p>They next try to relate this to a minimum norm interpretation, though this is
not a fruitful direction because their results are worse when they try to find
minimum norm solutions. On MNIST, their best solution using some “Gabor wavelet
transform” (what?), is twice as better as the minimum norm solution. I’m not
sure how much stock to put into this section, other than how I like their
perspective of thinking of SGD as an implicit regularizer (like batch
normalization) rather than an optimizer. The line between the categories is
blurring.</p>
<p>To conclude, from my growing experience with Deep Learning, I don’t find their
experimental results surprising. That’s not to say the paper was entirely
predictable, but think of it this way: if I were a computer vision researcher
preAlexNet, I would be <em>more</em> surprised at reading the AlexNet paper as I am
today reading this paper. Ultimately, as I mentioned earlier, I enjoyed this
paper, and while it was predictable (that word again…) that it couldn’t offer
any <em>solutions</em>, perhaps it will be useful as a starting point to understanding
generalization in Deep Learning.</p>
Fri, 19 May 2017 01:00:00 0700
https://danieltakeshi.github.io/2017/05/19/understandingdeeplearningrequiresrethinkinggeneralizationmythoughtsandnotes
https://danieltakeshi.github.io/2017/05/19/understandingdeeplearningrequiresrethinkinggeneralizationmythoughtsandnotesMathematical Tricks Commonly Used in Machine Learning and Statistics<p>I have passionately studied various machine learning and statistical concepts
over the last few years. One thing I’ve learned from all this is that there are
many mathematical “tricks” involved, whether or not they are explicitly stated.
(In research papers, such tricks are often used without acknowledgment since it
is assumed that anyone who can benefit from reading the paper has the
mathematical maturity to fill in the details.) I thought it would be useful for
me, and hopefully for a few interested readers, to catalogue a set of the common
tricks here, and to see them applied in a few examples.</p>
<p>The following list, in alphabetical order, is a nonexhaustive set of tricks
that I’ve seen:</p>
<ul>
<li>CauchySchwarz</li>
<li>Integrating Probabilities into Expectations</li>
<li>Introducing an Independent Copy</li>
<li>Jensen’s Inequality</li>
<li>Law of Iterated Expectation</li>
<li>Lipschitz Functions</li>
<li>Markov’s Inequality</li>
<li>Norm Properties</li>
<li>Series Expansions (e.g. Taylor’s)</li>
<li>Stirling’s Approximation</li>
<li>Symmetrization</li>
<li>Take a Derivative</li>
<li>Union Bound</li>
<li>Variational Representations</li>
</ul>
<p>If the names are unclear or vague, the examples below should clarify. All the
tricks are used except for the law of iterated expectation, i.e.
<script type="math/tex">\mathbb{E}[\mathbb{E}[XY]] = \mathbb{E}[X]</script>. (No particular reason for that
omission; it just turns out the exercises I’m interested in didn’t require it.)</p>
<h2 id="example1maximumofnotnecessarilyindependentsubgaussians">Example 1: Maximum of (Not Necessarily Independent!) subGaussians</h2>
<p>I covered this problem in <a href="https://danieltakeshi.github.io/2017/04/22/followingprofessormichaeljordansadviceyourbrainneedsexercise">my last post here</a> so I will not repeat the
details. However, there are two extensions to that exercise which I thought would
be worth noting.</p>
<p><strong>First</strong>, To prove an upper bound for the random variable <script type="math/tex">Z =
\max_{i=1,2,\ldots,n}X_i</script>, it suffices to proceed as we did earlier in the
nonabsolute value case, but <em>augment</em> our subGaussian variables
<script type="math/tex">X_1,\ldots,X_n</script> with the set <script type="math/tex">X_1,\ldots,X_n</script>. It’s OK to do this because
no independence assumptions are needed. Then it turns out that an upper bound
can be derived as</p>
<script type="math/tex; mode=display">\mathbb{E}[Z] \le 2\sqrt{\sigma^2 \log n}</script>
<p>This is the same as what we had earlier, except the “2” is now outside the
square root. It’s quite intuitive.</p>
<p><strong>Second</strong>, consider how we can prove the following bound:</p>
<script type="math/tex; mode=display">\mathbb{P}\Big[Z \ge 2\sqrt{\sigma^2 \log n} + \delta\Big] \le 2e^{\frac{\delta^2}{2\sigma^2}}</script>
<p>We start by applying the standard technique of multiplying by <script type="math/tex">\lambda>0</script>,
exponentiating and then applying Markov’s Inequality with our nonnegative
random variable <script type="math/tex">e^{\lambda Z}</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbb{P}\left[Z \ge 2\sqrt{\sigma^2 \log n}+\delta\right] &= \mathbb{P}\left[e^{\lambda Z} \ge e^{\lambda (2\sqrt{\sigma^2 \log n} +\delta)}\right] = \\
&\le \mathbb{E}[e^{\lambda Z}]e^{\lambda 2\sqrt{\sigma^2 \log n}} e^{\lambda \delta} \\
&{\overset{(i)}\le}\; 2n \exp\left(\frac{\lambda^2\sigma^2}{2}\lambda\Big(\delta+ 2\sqrt{\sigma^2 \log n}\Big)\right) \\
&{\overset{(ii)}\le}\; 2n\exp\left(\frac{1}{2\sigma^2}\Big(\delta+ 2\sqrt{\sigma^2 \log n}\Big)^2\right) \\
&= 2 \exp\left(\frac{1}{2\sigma^2}\left[2\sigma^2 \log n + \delta^2 + 4\delta \sqrt{\sigma^2\log n} + 4\sigma^2\log n \right]\right)
\end{align*} %]]></script>
<p>where in (i) we used a bound previously determined in our bound on
<script type="math/tex">\mathbb{E}[Z]</script> (it came out of an intermediate step), and then used the fact
that the term in the exponential is a convex quadratic to find the minimizer
value <script type="math/tex">\lambda^* = \frac{\delta+2\sqrt{\sigma^2 \log n}}{\sigma^2}</script> via
differentiation in (ii).</p>
<p>At this point, to satisfy the desired inequality, we compare terms in the
exponentials and claim that with <script type="math/tex">\delta \ge 0</script>,</p>
<script type="math/tex; mode=display">2\sigma^2 \log n + 4\delta \sqrt{\sigma^2\log n} + \delta^2 \ge \delta^2</script>
<p>This will result in our desired bound. It therefore remains to prove
this, but it reduces to checking that</p>
<script type="math/tex; mode=display">2\sigma^2 \log n + 4\delta \sqrt{\sigma^2\log n} \ge 0</script>
<p>and the left hand side is nonnegative. Hence, the desired bound holds.</p>
<p>Tricks used:</p>
<ul>
<li>Jensen’s Inequality</li>
<li>Markov’s Inequality</li>
<li>Take a Derivative</li>
<li>Union Bound</li>
</ul>
<p><strong>Comments</strong>: My earlier blog post (along with this one) shows what I mean when
I say “take a derivative.” It happens when there is an upper bound on the right
hand side and we have a free parameter <script type="math/tex">\lambda \in \mathbb{R}</script> (or <script type="math/tex">\lambda
\ge 0</script>) which we can optimize to get the <em>tighest</em> possible bound. Often times,
such a <script type="math/tex">\lambda</script> is <em>explicitly introduced</em> via Markov’s Inequality, as we
have here. Just make sure to double check that when taking a derivative, you’re
getting a <em>minimum</em>, not a maximum. In addition, Markov’s Inequality can only be
applied to <em>nonnegative</em> random variables, which is why we often have to
exponentiate the terms inside a probability statement first.</p>
<p>Note the use of convexity of the exponential function. It is <em>very common</em> to
see Jensen’s inequality applied with the exponential function. Always remember
that <script type="math/tex">e^{\mathbb{E}[X]} \le \mathbb{E}[e^X]</script>!!</p>
<p>The procedure that I refer to as the “union bound” when I bound a maximum by a
sum isn’t exactly the canonical way of doing it, since that typically involves
probabilities, but it has a similar flavor. More formally, the <a href="https://en.wikipedia.org/wiki/Boole%27s_inequality">union bound</a>
states that</p>
<script type="math/tex; mode=display">\mathbb{P}\left[\cup_{i=1}^n A_i\right] \le \sum_{i=1}^n \mathbb{P}\left[A_i\right]</script>
<p>for countable sets of events <script type="math/tex">A_1,A_2,\ldots</script>. When we define a set of events
based on a maximum of certain variables, that’s the same as taking the union of
the individual events.</p>
<p>On a final note, be on the lookout for applications of this type whenever a
“maximum” operation is seen with something that resembles Gaussians. Sometimes
this can be a bit subtle. For instance, it’s not uncommon to use a bound of the
form above when dealing with <script type="math/tex">\mathbb{E}[\w\_\infty]</script>, the expectation of
the <script type="math/tex">L_\infty</script>norm of a standard Gaussian vector. In addition, when dealing
with sparsity, often our “<script type="math/tex">n</script>” or “<script type="math/tex">d</script>” is actually something like <script type="math/tex">{d
\choose s}</script> or another combinatoricsstyle value. Seeing a “log” accompanied by
a square root is a good clue and may help identify such cases.</p>
<h2 id="example2boundedrandomvariablesaresubgaussian">Example 2: Bounded Random Variables are SubGaussian</h2>
<p>This example is really split into two parts. The first is as follows:</p>
<blockquote>
<p>Prove that Rademacher random variables are subGaussian with parameter
<script type="math/tex">\sigma = 1</script>.</p>
</blockquote>
<p>The next is:</p>
<blockquote>
<p>Prove that if <script type="math/tex">X</script> is a zeromean and has support <script type="math/tex">X \in [a,b]</script>, then <script type="math/tex">X</script>
is subGaussian with parameter (at most) <script type="math/tex">\sigma = ba</script>.</p>
</blockquote>
<p>To prove the first part, let <script type="math/tex">\varepsilon</script> be a Rademacher random variable.
For <script type="math/tex">\lambda \in \mathbb{R}</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}[e^{\lambda \epsilon}] \;&{\overset{(i)}{=}}\; \frac{1}{2}\left(e^{\lambda} + e^{\lambda}\right) \\
\;&{\overset{(ii)}{=}}\; \frac{1}{2}\left( \sum_{k=0}^\infty \frac{(\lambda)^k}{k!} + \sum_{k=0}^\infty \frac{\lambda^k}{k!}\right) \\
\;&{\overset{(iii)}{=}}\; \sum_{k=0}^\infty \frac{\lambda^{2k}}{(2k)!} \\
\;&{\overset{(iv)}{\le}}\; \sum_{k=0}^\infty \frac{\lambda^{2k}}{2^kk!} \\
\;&{\overset{(v)}{=}}\; e^{\frac{\lambda^2}{2}},
\end{align} %]]></script>
<p>and thus the claim is satisfied by the definition of a subGaussian random
variable. In (i), we removed the expectation by using facts from
Rademacher random variables, in (ii) we used the series expansions of the
exponential function, in (iii) we simplified by removing the odd powers, in (iv)
we used the clever trick that <script type="math/tex">2^kk! \le (2k)!</script>, and in (v) we <em>again</em> used
the exponential function’s power series.</p>
<p>To prove the next part, observe that for any <script type="math/tex">\lambda \in \mathbb{R}</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{E}_{X}[e^{\lambda X}] \;&{\overset{(i)}{=}}\; \mathbb{E}_{X}\Big[e^{\lambda (X  \mathbb{E}_{X'}[X'])}\Big] \\
\;&{\overset{(ii)}{\le}}\; \mathbb{E}_{X,X'}\Big[e^{\lambda (X  X')}\Big] \\
\;&{\overset{(iii)}{=}}\; \mathbb{E}_{X,X',\varepsilon}\Big[e^{\lambda \varepsilon(X  X')}\Big] \\
\;&{\overset{(iv)}{\le}}\; \mathbb{E}_{X,X'}\Big[e^{\frac{\lambda^2 (X  X')^2}{2}}\Big] \\
\;&{\overset{(v)}{\le}}\; e^{\frac{\lambda^2(ba)^2}{2}},
\end{align} %]]></script>
<p>which shows by definition that <script type="math/tex">X</script> is subGaussian with parameter <script type="math/tex">\sigma =
ba</script>. In (i), we cleverly introduce <em>an extra independent copy</em> <script type="math/tex">X'</script> inside
the exponent. It’s zeromean, so we can insert it there without issues.<sup id="fnref:miller"><a href="#fn:miller" class="footnote">1</a></sup>
In (ii), we use Jensen’s inequality, and note that we can do this with respect
to just the random variable <script type="math/tex">X'</script>. (If this is confusing, just think of the
expression as a function of <script type="math/tex">X'</script> and ignore the outer expectation.) In (iii)
we apply a clever <em>symmetrization</em> trick by multiplying a Rademacher random
variable to <script type="math/tex">XX'</script>. The reason why we can do this is that <script type="math/tex">XX'</script> is already
symmetric about zero. Hence, inserting the Rademacher factor will maintain that
symmetry (since Rademachers are only +1 or 1). In (iv), we applied the
Rademacher subGaussian bound with <script type="math/tex">XX'</script> held fixed, and then in (v), we
finally use the fact that <script type="math/tex">X,X' \in [a,b]</script>.</p>
<p>Tricks used:</p>
<ul>
<li>Introducing an Independent Copy</li>
<li>Jensen’s Inequality</li>
<li>Series Expansions (twice!!)</li>
<li>Symmetrization</li>
</ul>
<p><strong>Comments</strong>: The first part is a classic exercise in theoretical statistics,
one which tests your ability to understand how to use the power series of
exponential functions. The first part involved converting an exponential
function to a power series, and then later doing <em>the reverse</em>. When I was doing
this problem, I found it easiest to start by stating the conclusion — that we
would have <script type="math/tex">e^{\frac{\lambda^2}{2}}</script> somehow — and then I worked backwards.
Obviously, this only works when the problem gives us the solution!</p>
<p>The next part is also “classic” in the sense that it’s often how students (such
as myself) are introduced to the symmetrization trick. The takeaway is that one
should be on the lookout for anything that seems symmetric. Or, failing that,
perhaps <em>introduce</em> symmetry by adding in an extra independent copy, as we did
above. But make sure that your random variables are zeromean!!</p>
<h2 id="example3concentrationaroundmedianandmeans">Example 3: Concentration Around Median and Means</h2>
<p>Here’s the question:</p>
<blockquote>
<p>Given a scalar random variable <script type="math/tex">X</script>, suppose that there are positive
constants <script type="math/tex">c_1,c_2</script> such that</p>
<script type="math/tex; mode=display">\mathbb{P}[X\mathbb{E}[X] \ge t] \le c_1e^{c_2t^2}</script>
<p>for all <script type="math/tex">t \ge 0</script>.</p>
<p>(a) Prove that <script type="math/tex">{\rm Var}(X) \le \frac{c_1}{c_2}</script></p>
<p>(b) Prove that for any median <script type="math/tex">m_X</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le c_3e^{c_4t^2}</script>
<p>for all <script type="math/tex">t \ge 0</script>, where <script type="math/tex">c_3 = 4c_1</script> and <script type="math/tex">c_4 = \frac{c_2}{8}</script>.</p>
</blockquote>
<p>To prove the first part, note that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
{\rm Var}(X) \;&{\overset{(i)}{=}}\; \mathbb{E}\Big[X\mathbb{E}[X]^2 \Big] \\
\;&{\overset{(ii)}{=}}\; 2 \int_{t=0}^\infty t \cdot \mathbb{P}[X\mathbb{E}[X] \ge t]dt \\
\;&{\overset{(iii)}{\le}}\; \frac{c_2}{c_2} \int_{t=0}^\infty 2t c_1e^{c_2t^2} dt \\
\;&{\overset{(iv)}{=}}\; \frac{c_1}{c_2},
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from the “integrating
probabilities into expectations” trick (which I will describe shortly), (iii)
follows from the provided bound, and (iv) follows from standard calculus (note
the multiplication of <script type="math/tex">c_2/c_2</script> for mathematical convenience). This proves the
first claim.</p>
<p>This second part requires some clever insights to get this to work. One way to
start is by noting that:</p>
<script type="math/tex; mode=display">\frac{1}{2} = \mathbb{P}[X \ge m_X] = \mathbb{P}\Big[X\mathbb{E}[X] \ge
m_X\mathbb{E}[X]\Big] \le c_1e^{c_2(m_X\mathbb{E}[X])^2}</script>
<p>and where the last inequality follows from the bound provided in the question.
For us to be able to apply that bound, assume without loss of generality that
<script type="math/tex">m_X \ge \mathbb{E}[X]</script>, meaning that our <script type="math/tex">t = m_X\mathbb{E}[X]</script> term is
positive and that we can increase the probability by inserting in absolute
values. The above also shows that</p>
<script type="math/tex; mode=display">m_X\mathbb{E}[X] \le \sqrt{\frac{\log(2c_1)}{c_2}}</script>
<p>We next tackle the core of the question. Starting from the left hand side of the
desired bound, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}[Xm_X \ge t] \;&{\overset{(i)}{=}}\; \mathbb{P}\Big[X + \mathbb{E}[X]  \mathbb{E}[X]m_X \ge t\Big] \\
\;&{\overset{(ii)}{\le}}\; \mathbb{P}\Big[X + \mathbb{E}[X]  \ge t  \mathbb{E}[X]  m_X\Big] \\
\;&{\overset{(iii)}{\le}}\; c_1e^{c_2(t  \mathbb{E}[X]  m_X)^2}
\end{align} %]]></script>
<p>where step (i) follows from adding zero, step (ii) follows from the Triangle
Inequality, and (iii) follows from the provided bound based on the expectation.
And yes, this is supposed to work only for when <script type="math/tex">t\mathbb{E}[X]m_X > 0</script>. The
way to get around this is that we need to assume <script type="math/tex">t</script> is greater than some
quantity. After some algebra, it turns out a nice condition for us to enforce is
that <script type="math/tex">t > \sqrt{\frac{8\log(4c_1)}{c_2}}</script>, which in turn will make
<script type="math/tex">t\mathbb{E}[X]m_X > 0</script>. If <script type="math/tex">% <![CDATA[
t < \sqrt{\frac{8\log(4c_1)}{c_2}} %]]></script>, then the
desired bound is attained because</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le 1 \le 4c_1 e^{\frac{c_2}{8}t^2}</script>
<p>a fact which can be derived through some algebra. Thus, the remainder of the
proof boils down to checking the case that when <script type="math/tex">t >
\sqrt{\frac{8\log(4c_1)}{c_2}}</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{P}[Xm_X \ge t] \le c_1e^{c_2(t  \mathbb{E}[X]  m_X)^2} \le 4c_1 e^{\frac{c_2}{8}t^2}</script>
<p>and this is proved by analyzing roots of the quadratic and solving for <script type="math/tex">t</script>.</p>
<p>Tricks used:</p>
<ul>
<li>Integrating Probabilities into Expectations</li>
<li>Triangle Inequality</li>
</ul>
<p><strong>Comments</strong>: The trick “integrating probabilities into expectations” is one
which I only recently learned about, though one can easily find it (along with
the derivation) on the <a href="https://en.wikipedia.org/wiki/Expected_value">Wikipedia page for the expected values</a>. In
particular, note that for a positive real number <script type="math/tex">\alpha</script>, we have</p>
<script type="math/tex; mode=display">\mathbb{E}[X^\alpha] = \alpha \int_{0}^\infty t^{\alpha1}\mathbb{P}[X \ge t]dt</script>
<p>and in the above, I use this trick with <script type="math/tex">\alpha=2</script>. It’s quite useful to
convert between probabilities and expectations!</p>
<p>The other trick above is using the triangle inequality in a clever way. The key
is to observe that when we have something like <script type="math/tex">\mathbb{P}[X\ge Y]</script>, if we
<em>increase</em> the value of <script type="math/tex">X</script>, then we increase that probability. This is
another common trick used in proving various bounds.</p>
<p>Finally, the above also shows that when we have constants <script type="math/tex">t</script>, it pays to be
clever in how we assign those values. Then the remainder is some bruteforce
computation. I suppose it also helps to think about inserting <script type="math/tex">1/2</script>s whenever
we have a probability and a median.</p>
<h2 id="example4upperboundsforell_0balls">Example 4: Upper Bounds for <script type="math/tex">\ell_0</script> “Balls”</h2>
<p>Consider the set</p>
<script type="math/tex; mode=display">T^d(s) = \{\theta \in \mathbb{R}^d \mid \\theta\_0 \le s, \\theta\_2 \le 1\}</script>
<p>We often write the number of nonzeros in <script type="math/tex">\theta</script> as <script type="math/tex">\\theta\_0</script> like
this even though <script type="math/tex">\\cdot\_0</script> is not technically a norm. This exercise
consists of three parts:</p>
<blockquote>
<p>(a) Show that <script type="math/tex">\mathcal{G}(T^d(s)) = \mathbb{E}[\max_{\mathcal{S}}
\w_S\_2]</script> where <script type="math/tex">\mathcal{S}</script> consists of all subsets <script type="math/tex">S</script> of
<script type="math/tex">\{1,2,\ldots, d\}</script> of size <script type="math/tex">s</script>, and <script type="math/tex">w_S</script> is a subvector of <script type="math/tex">w</script> (of
size <script type="math/tex">s</script>) indexed by those components. Note that by this definition, the
cardinality of <script type="math/tex">\mathcal{S}</script> is equal to <script type="math/tex">{d \choose s}</script>.</p>
<p>(b) Show that for any fixed subset <script type="math/tex">S</script> of cardinality <script type="math/tex">s</script>, we have
<script type="math/tex">\mathbb{P}[\w_S\_2 \ge \sqrt{s} + \delta] \le e^{\frac{\delta^2}{2}}</script>.</p>
<p>(c) Establish the claim that <script type="math/tex">\mathcal{G}(T^d(s)) \precsim \sqrt{s \log
\left(\frac{ed}{s}\right)}</script>.</p>
</blockquote>
<p>To be clear on the notation, <script type="math/tex">\mathcal{G}(T^d(s)) =
\mathbb{E}\left[\sup_{\theta \in T^d(s)} \langle \theta, w \rangle\right]</script> and
refers to the <em>Gaussian complexity</em> of that set. It is, roughly speaking, a way
to measure the “size” of a set.</p>
<p>To prove (a), let <script type="math/tex">\theta \in T^d(s)</script> and let <script type="math/tex">S</script> indicate the support of
<script type="math/tex">\theta</script> (i.e. where its nonzeros occur). For any <script type="math/tex">w \in \mathbb{R}^d</script>
(which we later treat to be sampled from <script type="math/tex">N(0,I_d)</script>, though the immediate
analysis below does not require that fact) we have</p>
<script type="math/tex; mode=display">\langle \theta, w \rangle =
\langle \tilde{\theta}, w_S \rangle \le
\\tilde{\theta}\_2 \w_S\_2 \le
\w_S\_2,</script>
<p>where <script type="math/tex">\tilde{\theta}\in \mathbb{R}^s</script> refers to the vector taking only the
nonzero components from <script type="math/tex">\theta</script>. The first inequality follows from
CauchySchwarz. In addition, by standard norm properties, taking <script type="math/tex">\theta =
\frac{w_S}{\w_S\_2} \in T^d(s)</script> results in the case when equality is
attained. The claim thus follows. (There are some technical details needed
regarding which of the maximums — over the set sizes or over the vector
selection — should come first, but I don’t think the details are critical for
me to know.)</p>
<p>For (b), we first claim that the function <script type="math/tex">f_S : \mathbb{R}^d \to \mathbb{R}</script>
defined as <script type="math/tex">f_S(w) := \w_S\_2</script> is Lipschitz with respect to the Euclidean
norm with Lipschitz constant <script type="math/tex">L=1</script>. To see this, observe that when <script type="math/tex">w</script> and
<script type="math/tex">w'</script> are both <script type="math/tex">d</script>dimensional vectors, we have</p>
<script type="math/tex; mode=display">f_S(w)f_S(w') =
\Big\w_S\_2\w_S'\_2\Big \;{\overset{(i)}{\le}}\;
\w_Sw_S'\_2 \;{\overset{(ii)}{\le}}\;
\ww'\_2,</script>
<p>where (i) follows from the reverse triangle inequality for normed spaces and
(ii) follows from how the vector <script type="math/tex">w_Sw_S'</script> cannot have more nonzero terms
than <script type="math/tex">ww'</script> but must otherwise match it for indices lying in the subset <script type="math/tex">S</script>.</p>
<p>The fact that <script type="math/tex">f_S</script> is Lipschitz means that we can apply a theorem regarding
tail bounds of Lipschitz functions of Gaussian variables. The function <script type="math/tex">f_S</script>
here doesn’t <em>require</em> its input to consist of vectors with IID standard
Gaussian components, but we have to assume that the input is like that for the
purposes of the theorem/bound to follow. More formally, for all <script type="math/tex">\delta \ge 0</script>
we have</p>
<script type="math/tex; mode=display">\mathbb{P}\Big[\w_S\_2 \ge \sqrt{s} + \delta\Big] \;{\overset{(i)}{\le}}\;
\mathbb{P}\Big[\w_S\_2 \ge \mathbb{E}[\w_S\_2] + \delta \Big]\;{\overset{(ii)}{\le}}\;
e^{\frac{\delta^2}{2}}</script>
<p>where (i) follows from how <script type="math/tex">\mathbb{E}[\w_S\_2] \le \sqrt{s}</script> and thus we
are just decreasing the threshold for the event (hence making it more likely)
and (ii) follows from the theorem, which provides an <script type="math/tex">L</script> in the denominator of
the exponential, but <script type="math/tex">L=1</script> here.</p>
<p>Finally, to prove (c), we first note that the previous part’s theorem guaranteed
that the function <script type="math/tex">f_S(w) = \w_S\_2</script> is subGaussian with parameter
<script type="math/tex">\sigma=L=1</script>. Using this, we have</p>
<script type="math/tex; mode=display">\mathcal{G}(T^d(s)) = \mathbb{E}\Big[\max_{S \in \mathcal{S}} \w_S\_2\Big]
\;{\overset{(i)}{\le}}\; \sqrt{2 \sigma^2 \log {d \choose s}}
\;{\overset{(ii)}{\precsim}}\; \sqrt{s \log \left(\frac{ed}{s}\right)}</script>
<p>where (i) applies the bound for a maximum over subGaussian random variables
<script type="math/tex">\w_S\_2</script> for all the <script type="math/tex">{d\choose s}</script> sets <script type="math/tex">S \in \mathcal{S}</script> (see
Example 1 earlier), each with parameter <script type="math/tex">\sigma</script>, and (ii) applies an
approximate bound due to Stirling’s approximation and ignores the constants of
<script type="math/tex">\sqrt{2}</script> and <script type="math/tex">\sigma</script>. The careful reader will note that Example 1
required <em>zero</em>mean subGaussian random variables, but we can generally get
around this by, I believe, subtracting away a mean and then readding later.</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Jensen’s Inequality</li>
<li>Lipschitz Functions</li>
<li>Norm Properties</li>
<li>Stirling’s Approximation</li>
<li>Triangle Inequality</li>
</ul>
<p><strong>Comments</strong>: This exercise involves a number of tricks. The fact that
<script type="math/tex">\mathbb{E}[\w_S\_2] \le \sqrt{s}</script> follows from how</p>
<script type="math/tex; mode=display">\mathbb{E}[\w_S\_2] = \mathbb{E}\Big[\sqrt{\w_S\_2^2}\Big] \le
\sqrt{\mathbb{E}[\w_S\_2^2]} = \sqrt{s}</script>
<p>due to Jensen’s inequality and how <script type="math/tex">\mathbb{E}[X^2]=1</script> for <script type="math/tex">X \sim
N(0,1)</script>. Fiddling with norms, expectations, and square roots is another common
way to utilize Jensen’s inequality (in addition to using Jensen’s inequality
with the exponential function, as explained earlier). Moreover, if you see norms
in a probabilistic bound statement, you should immediately be thinking of the
possibility of using a theorem related to Lipschitz functions.</p>
<p>The example also uses the (reverse!) triangle inequality for norms:</p>
<script type="math/tex; mode=display">\Big \x\_2\y\_2\Big \le \xy\_2</script>
<p>This can come up quite often and is the noncanonical way of viewing the
triangle inequality, so watch out!</p>
<p>Finally, don’t forget the trick where we have <script type="math/tex">{d \choose s} \le
\left(\frac{ed}{s}\right)^s</script>. This comes from <a href="https://math.stackexchange.com/questions/132625/nchoosekleqleftfracenkrightk">an application of Stirling’s
approximation</a> and is seen frequently in cases involving <em>sparsity</em>, where
<script type="math/tex">s</script> components are “selected” out of <script type="math/tex">d \gg s</script> total. The maximum over a
finite set should also provide a big hint regarding the use of a subGaussian
bound over maximums of (subGaussian) variables.</p>
<h2 id="example5gaussiancomplexityofellipsoids">Example 5: Gaussian Complexity of Ellipsoids</h2>
<blockquote>
<p>Recall that the space <script type="math/tex">\ell_2(\mathbb{N})</script> consists of all real sequences
<script type="math/tex">\{\theta_j\}_{j=1}^\infty</script> such that <script type="math/tex">\sum_{j=1}^\infty \theta_j^2 \le
\infty</script>. Given a strictly positive sequence <script type="math/tex">\{\mu_j\}_{j=1}^\infty \in \ell_2(\mathbb{N})</script>,
consider the associated ellipse</p>
<script type="math/tex; mode=display">\mathcal{E} := \left\{\{\theta_j\}_{j=1}^\infty \in \ell_2(\mathbb{N}) \;\Big
\sum_{j=1}^\infty \frac{\theta_j^2}{\mu_j^2} \le 1\right\}</script>
<p>(a) Prove that the Gaussian complexity satisfies the bounds</p>
<script type="math/tex; mode=display">\sqrt{\frac{2}{\pi}}\left(\sum_{j=0}^\infty \mu_j^2 \right)^{1/2} \le
\mathcal{G}(\mathcal{E}) \le \left(\sum_{j=0}^\infty \mu_j^2 \right)^{1/2}</script>
<p>(b) For a given radius <script type="math/tex">r > 0</script>, consider the truncated set</p>
<script type="math/tex; mode=display">\tilde{\mathcal{E}} := \mathcal{E} \cap \left\{\{\theta_j\}_{j=1}^\infty
\;\Big \sum_{j=1}^\infty \theta_j^2 \le r^2 \right\}</script>
<p>Obtain upper and lower bounds on its Gaussian complexity that are tight up to
universal constants independent of <script type="math/tex">r</script> and <script type="math/tex">\{\mu_j\}_{j=1}^\infty</script>.</p>
</blockquote>
<p>To prove (a), we first start with the upper bound. Letting <script type="math/tex">w</script> indicate a
sequence of IID standard Gaussians <script type="math/tex">w_i</script>, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{G}(\mathcal{E}) \;&{\overset{(i)}{=}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty w_i\theta_i \right] \\
\;&{\overset{(ii)}{=}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \frac{\theta_i}{\mu_i}w_i\mu_i \right] \\
\;&{\overset{(iii)}{\le}}\; \mathbb{E}_w\left[ \sup_{\theta \in \mathcal{E}} \left(\sum_{i=1}^\infty\frac{\theta_i^2}{\mu_i^2}\right)^{1/2}\left(\sum_{i=1}^\infty w_i^2 \mu_i^2\right)^{1/2} \right] \\
\;&{\overset{(iv)}{\le}}\; \mathbb{E}_w\left[ \left(\sum_{i=1}^\infty w_i^2 \mu_i^2 \right)^{1/2} \right] \\
\;&{\overset{(v)}{\le}}\; \sqrt{\mathbb{E}_w\left[ \sum_{i=1}^\infty w_i^2 \mu_i^2 \right]} \\
\;&{\overset{(vi)}{=}}\; \left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from multiplying by one, (iii)
follows from a clever application of the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">CauchySchwarz inequality</a> for
sequences (or more generally, <a href="https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality">Holder’s Inequality</a>), (iv) follows from the
definition of <script type="math/tex">\mathcal{E}</script>, (v) follows from Jensen’s inequality, and (vi)
follows from linearity of expectation and how <script type="math/tex">\mathbb{E}_{w_i}[w_i^2]=1</script>.</p>
<p>We next prove the lower bound. First, we note a wellknown result that
<script type="math/tex">\sqrt{\frac{2}{\pi}}\mathcal{R}(\mathcal{E}) \le \mathcal{G}(\mathcal{E})</script>
where <script type="math/tex">\mathcal{R}(\mathcal{E})</script> indicates the <em>Rademacher</em> complexity of the
set. Thus, our task now boils down to showing that <script type="math/tex">\mathcal{R}(\mathcal{E}) =
\left(\sum_{i=1}^\infty \mu_i^2 \right)^{1/2}</script>. Letting <script type="math/tex">\varepsilon_i</script> be
IID Rademachers, we first begin by proving the <em>upper</em> bound</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{R}(\mathcal{E}) \;&{\overset{(i)}{=}}\; \mathbb{E}_\varepsilon\left[ \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \varepsilon_i\theta_i \right] \\
\;&{\overset{(ii)}{=}}\; \sup_{\theta \in \mathcal{E}}\sum_{i=1}^\infty \Big\frac{\theta_i}{\mu_i}\mu_i\Big \\
\;&{\overset{(iii)}{\le}}\; \sup_{\theta \in \mathcal{E}} \left(\sum_{i=1}^\infty\frac{\theta_i^2}{\mu_i^2}\right)^{1/2}\left(\sum_{i=1}^\infty \mu_i^2\right)^{1/2} \\
\;&{\overset{(iv)}{=}}\; \left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}
\end{align} %]]></script>
<p>where (i) follows from definition, (ii) follows from the symmetric nature of the
class of <script type="math/tex">\theta</script> (meaning that WLOG we can pick <script type="math/tex">\varepsilon_i = 1</script> for all
<script type="math/tex">i</script>) and then multiplying by one, (iii), follows from CauchySchwarz again,
and (iv) follows from the provided bound in the definition of <script type="math/tex">\mathcal{E}</script>.</p>
<p>We’re not done yet: we actually need to show <em>equality</em> for this, or at the very
least prove a <em>lower</em> bound instead of an upper bound. However, if one chooses
the valid sequence <script type="math/tex">\{\theta_j\}_{j=1}^\infty</script> such that <script type="math/tex">\theta_j =
\mu_j^2 / (\sum_{j=1}^\infty \mu_j^2)^{1/2}</script>, then equality is attained since we
get</p>
<script type="math/tex; mode=display">\frac{\sum_{i=1}^\infty \mu_i^2}{\left(\sum_{i=1}^\infty \mu_i^2\right)^{1/2}} =
\left( \sum_{i=1}^\infty \mu_i^2 \right)^{1/2}</script>
<p>in one of our steps above. This proves part (a).</p>
<p>For part (b), we construct two ellipses, one that contains
<script type="math/tex">\tilde{\mathcal{E}}</script> and one which is contained inside it. Let <script type="math/tex">m_i :=
\min\{\mu_i, r\}</script>. Then we claim that the ellipse <script type="math/tex">\mathcal{E}_{m}</script> defined
out of this sequence (i.e. treating “<script type="math/tex">m</script>” as our “<script type="math/tex">\mu</script>”) will be contained
in <script type="math/tex">\tilde{\mathcal{E}}</script>. We moreover claim that the ellipse
<script type="math/tex">\mathcal{E}^{m}</script> defined out of the sequence <script type="math/tex">\sqrt{2} \cdot m_i</script> for all
<script type="math/tex">i</script> contains <script type="math/tex">\tilde{\mathcal{E}}</script>, i.e. <script type="math/tex">\mathcal{E}_m \subset
\tilde{\mathcal{E}} \subset \mathcal{E}^m</script>. If this is true, it then follows
that</p>
<script type="math/tex; mode=display">\mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m)</script>
<p>because the definition of Gaussian complexity requires taking a maximum of
<script type="math/tex">\theta</script> over a set, and if the set grows larger via set containment, then the
Gaussian complexity can only grow larger. In addition, the fact that the upper
and lower bounds are related by a constant <script type="math/tex">\sqrt{2}</script> suggests that there
should be extra lower and upper bounds utilizing universal constants independent
of <script type="math/tex">r</script> and <script type="math/tex">\mu</script>.</p>
<p>Let us prove the two set inclusions previously described, as well as develop the
desired upper and lower bounds. Suppose <script type="math/tex">\{\theta_j\}_{j=1}^\infty \in
\mathcal{E}_m</script>. Then we have</p>
<script type="math/tex; mode=display">\sum_{i=1}^\infty \frac{\theta_i^2}{r^2} \le \sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{r,\mu_j\})^2} \le 1</script>
<p>and</p>
<script type="math/tex; mode=display">\sum_{i=1}^\infty \frac{\theta_i^2}{\mu_i^2} \le \sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{r,\mu_j\})^2} \le 1</script>
<p>In both cases, the first inequality is because we can only decrease the value in
the denominator.<sup id="fnref:downstairs"><a href="#fn:downstairs" class="footnote">2</a></sup> The last inequality follows by assumption of
membership in <script type="math/tex">\mathcal{E}_m</script>. Both requirements for membership in
<script type="math/tex">\tilde{\mathcal{E}}</script> are satisfied, and therefore,
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \mathcal{E}_m</script> implies
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \tilde{\mathcal{E}}</script> and thus the first set
containment. Moving on to the second set containment, suppose
<script type="math/tex">\{\theta_j\}_{j=1}^\infty \in \tilde{\mathcal{E}}</script>. We have</p>
<script type="math/tex; mode=display">\frac{1}{2}\sum_{i=1}^\infty \frac{\theta_i^2}{(\min\{\mu_i,r\})^2}
\;{\overset{(i)}{\le}}\;
\frac{1}{2}\left( \sum_{i=1}^\infty \frac{\theta_i^2}{r^2}+\sum_{i=1}^\infty
\frac{\theta_i^2}{\mu_i^2}\right)
\;{\overset{(ii)}{\le}}\; 1</script>
<p>where (i) follows from a “union bound”style argument, which to be clear,
happens because for every term <script type="math/tex">i</script> in the summation, we have either
<script type="math/tex">\frac{\theta_i^2}{r^2}</script> or <script type="math/tex">\frac{\theta_i^2}{\mu_i^2}</script> added to the
summation (both positive quantities). Thus, to make the value <em>larger</em>, just add
<em>both</em> terms! Step (ii) follows from the assumption of membership in
<script type="math/tex">\tilde{\mathcal{E}}</script>. Thus, we conclude that <script type="math/tex">\{\theta_j\}_{j=1}^\infty \in
\mathcal{E}_m</script>, and we have proved that</p>
<script type="math/tex; mode=display">\mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m)</script>
<p>The final step of this exercise is to develop a lower bound on the left hand
side and an upper bound on the right hand side that are close up to universal
constants. But we have reduced this to an instance of part (a)! Thus, we simply
apply the lower bound for <script type="math/tex">\mathcal{G}(\mathcal{E}_m)</script> and the upper bound for
<script type="math/tex">\mathcal{G}(\mathcal{E}^m)</script> and obtain</p>
<script type="math/tex; mode=display">\sqrt{\frac{2}{\pi}}\left(\sum_{i=1}^\infty m_i^2 \right)^{1/2}
\le \mathcal{G}(\mathcal{E}_m) \le
\mathcal{G}(\tilde{\mathcal{E}}) \le \mathcal{G}(\mathcal{E}^m) \le
\sqrt{2}\left(\sum_{i=1}^\infty m_i^2 \right)^{1/2}</script>
<p>as our final bounds on <script type="math/tex">\mathcal{G}(\tilde{\mathcal{E}})</script>. (Note that
as a sanity check, the constant offset <script type="math/tex">\sqrt{1/\pi} \approx 0.56</script> is less
than one.) This proves part (b).</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Jensen’s Inequality</li>
<li>Union Bound</li>
</ul>
<p><strong>Comments</strong>: This exercise on the surface looks extremely challenging. How does
one reason about multiple infinite sequences, which furthermore may or may not
involve squared terms? I believe the key to tackling these problems is to
understand how to apply CauchySchwarz (or more generally, Holder’s Inequality)
for infinite sequences. More precisely, Holder’s Inequality for sequences
spaces states that</p>
<script type="math/tex; mode=display">\sum_{k=1}^\infty x_ky_k \le \left(\sum_{k=1}^\infty x_k^2 \right)^{1/2}\left( \sum_{k=1}^\infty y_k^2 \right)^{1/2}</script>
<p>(It’s actually more general for this, since we can assume arbitrary positive
powers <script type="math/tex">p</script> and <script type="math/tex">q</script> so long as <script type="math/tex">1/p + 1/q=1</script>, but the easiest case to
understand is when <script type="math/tex">p=q=2</script>.)</p>
<p>Holder’s Inequality is <em>enormously helpful</em> when dealing with sums (whether
infinite or not), and <em>especially</em> when dealing with two sums if one does <em>not</em>
square its terms, but the other one <em>does</em>.</p>
<p>Finally, again, think about Jensen’s inequality whenever we have expectations
and a square root!</p>
<h2 id="example6pairwiseincoherence">Example 6: Pairwise Incoherence</h2>
<blockquote>
<p>Given a matrix <script type="math/tex">X \in \mathbb{R}^{n \times d}</script>, suppose it has normalized
columns (<script type="math/tex">\X\_2/\sqrt{n} = 1</script> for all <script type="math/tex">j = 1,...,d</script>) and pairwise
incoherence upper bounded as <script type="math/tex">% <![CDATA[
\delta_{\rm PW}(X) < \gamma %]]></script>.</p>
<p>(a) Let <script type="math/tex">S \subset \{1,2,\ldots,d\}</script> be any subset of size <script type="math/tex">s</script>. Show that
there is a function <script type="math/tex">\gamma \to c(\gamma)</script> such that <script type="math/tex">\lambda_{\rm
min}\left(\frac{X_S^TX_S}{n}\right) \ge c(\gamma) > 0</script> as long as <script type="math/tex">\gamma</script>
is sufficiently small, where <script type="math/tex">X_S</script> is the <script type="math/tex">n\times s</script> matrix formed by
extracting the <script type="math/tex">s</script> columns of <script type="math/tex">X</script> whose indices are in <script type="math/tex">S</script>.</p>
<p>(b) Prove, from first principles, that <script type="math/tex">X</script> satisfies the restricted
nullspace property with respect to <script type="math/tex">S</script> as long as <script type="math/tex">% <![CDATA[
\gamma < 1/3 %]]></script>.</p>
</blockquote>
<p>To clarify, the <em>pairwise incoherence</em> of a matrix <script type="math/tex">X \in \mathbb{R}^{n \times
d}</script> is defined as</p>
<script type="math/tex; mode=display">\delta_{\rm PW}(X) := \max_{j,k = 1,2,\ldots, d}
\left\frac{\langle X_j, X_k \rangle}{n}  \mathbb{I}[j\ne k]\right</script>
<p>where <script type="math/tex">X_i</script> denotes the <script type="math/tex">i</script>th <em>column</em> of <script type="math/tex">X</script>. Intuitively, it measures
the correlation between any columns, though it subtracts an indicator at the end
so that the maximal case does not always correspond to the case when <script type="math/tex">j=k</script>. In
addition, the matrix <script type="math/tex">\frac{X_S^TX_S}{n}</script> as defined in the problem looks like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{X_S^TX_S}{n} =
\begin{bmatrix}
\frac{(X_S)_1^T(X_S)_1}{n} & \frac{(X_S)_1^T(X_S)_2}{n} & \cdots & \frac{(X_S)_1^T(X_S)_s}{n} \\
\frac{(X_S)_1^T(X_S)_2}{n} & \frac{(X_S)_2^T(X_S)_2}{n} & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
\frac{(X_S)_1^T(X_S)_s}{n} & \cdots & \cdots & \frac{(X_S)_s^T(X_S)_s}{n} \\
\end{bmatrix} =
\begin{bmatrix}
1 & \frac{(X_S)_1^T(X_S)_2}{n} & \cdots & \frac{(X_S)_1^T(X_S)_s}{n} \\
\frac{(X_S)_1^T(X_S)_2}{n} & 1 & \cdots & \vdots \\
\vdots & \ddots & \ddots & \vdots \\
\frac{(X_S)_1^T(X_S)_s}{n} & \cdots & \cdots & 1 \\
\end{bmatrix} %]]></script>
<p>where the 1s in the diagonal are due to the assumption of having normalized columns.</p>
<p>First, we prove part (a). Starting from the <em>variational representation</em> of the
minimum eigenvalue, we consider any possible <script type="math/tex">v \in \mathbb{R}^s</script> with
Euclidean norm one (and thus this analysis will apply for the <em>minimizer</em>
<script type="math/tex">v^*</script> which induces the minimum eigenvalue) and observe that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
v^T\frac{X_S^TX_S}{n}v \;&{\overset{(i)}{=}}\; \sum_{i=1}^sv_i^2 + 2\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \\
\;&{\overset{(ii)}{=}}\; 1 + 2\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \\
\;&{\overset{(iii)}{\ge}}\; 1  2\frac{\gamma}{s}\sum_{i<j}^sv_iv_j \\
\;&{\overset{(iv)}{=}}\; 1  \frac{\gamma}{s}\left((v_i + \cdots + v_s)^2\sum_{i=1}^sv_i^2\right) \\
\;&{\overset{(v)}{\ge}}\; 1  \frac{\gamma}{s}\Big(s\v\_2^2)\v\_2^2\Big)
\end{align} %]]></script>
<p>where (i) follows from the definition of a quadratic form (less formally, by
matrix multiplication), (ii) follows from the <script type="math/tex">\v\_2 = 1</script> assumption, (iii)
follows from noting that</p>
<script type="math/tex; mode=display">% <![CDATA[
\sum_{i<j}^s\frac{(X_S)_i^T(X_S)_j}{n}v_iv_j \le \frac{\gamma}{s}\sum_{i<j}^sv_iv_j %]]></script>
<p>which in turn follows from the pairwise incoherence assumption that
<script type="math/tex">\Big\frac{(X_S)_i^T(X_S)_j}{n}\Big \le \frac{\gamma}{s}</script>. Step (iv) follows
from definition, and (v) follows from how <script type="math/tex">\v\_1 \le \sqrt{s}\v\_2</script> for
<script type="math/tex">s</script>dimensional vectors.</p>
<p>The above applies for any satisfactory <script type="math/tex">v</script>. Putting together the pieces, we
conclude that</p>
<script type="math/tex; mode=display">\lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) = \inf_{\v\_2=1} v^T\frac{X_S^TX_S}{n}v
\ge \underbrace{1  \gamma \frac{s1}{s}}_{c(\gamma)} \ge 1\gamma,</script>
<p>which follows if <script type="math/tex">\gamma</script> is sufficiently small.</p>
<p>To prove the restricted <a href="https://en.wikipedia.org/wiki/Nullspace_property">nullspace property</a> in (b), we first suppose that
<script type="math/tex">\theta \in \mathbb{R}^d</script> and <script type="math/tex">\theta \in {\rm null}(X) \setminus \{0\}</script>.
Define <script type="math/tex">d</script>dimensional vectors <script type="math/tex">\tilde{\theta}_S</script> and
<script type="math/tex">\tilde{\theta}_{S^c}</script> which match components of <script type="math/tex">\theta</script> for the indices
within their respective sets <script type="math/tex">S</script> or <script type="math/tex">S^c</script>, and which are zero
otherwise.<sup id="fnref:time"><a href="#fn:time" class="footnote">3</a></sup> Supposing that <script type="math/tex">S</script> corresponds to the subset of indices of
<script type="math/tex">\theta</script> of the <script type="math/tex">s</script> largest elements in absolute value, it suffices to show
that <script type="math/tex">\\tilde{\theta}_{S^c}\_1 > \\tilde{\theta}_S\_1</script>, because then we
can <em>never</em> violate this inequality (and thus the restricted nullspace property
holds).</p>
<p>We first show a few facts which we then piece together to get the final result.
The first is that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
0 \;&{\overset{(i)}{=}}\; \X\theta \_2^2 \\
\;&{\overset{(ii)}{=}}\; \X\tilde{\theta}_S + X\tilde{\theta}_{S^c}\_2^2 \\
\;&{\overset{(iii)}{=}}\; \X\tilde{\theta}_S\_2^2 + \X\tilde{\theta}_{S^c}\_2^2 + 2\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\\
\;&{\overset{(iv)}{\ge}}\; n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right)  2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big
\end{align} %]]></script>
<p>where (i) follows from the assumption that <script type="math/tex">\theta</script> is in the kernel of <script type="math/tex">X</script>,
(ii) follows from how <script type="math/tex">\theta = \tilde{\theta}_S + \tilde{\theta}_{S^c}</script>,
(iii) follows from expanding the term, and (iv) follows from carefully noting
that</p>
<script type="math/tex; mode=display">\lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) = \min_{v \in \mathbb{R}^d}
\frac{v^T\frac{X_S^TX_S}{n}v}{v^Tv} \le
\frac{\theta_S^T\frac{X_S^TX_S}{n}\theta_S}{\\theta_S\_2^2}</script>
<p>where in the inequality, we have simply chosen <script type="math/tex">\theta_S</script> as our <script type="math/tex">v</script>, which
can only make the bound worse. Then step (iv) follows immediately. Don’t forget
that <script type="math/tex">\\theta_S\_2^2 = \\tilde{\theta}_S\_2^2</script>, because the latter
involves a vector that (while longer) only has extra zeros. Incidentally, the
above uses the variational representation for eigenvalues in a way that’s more
convenient if we don’t want to restrict our vectors to have Euclidean norm one.</p>
<p>We conclude from the above that</p>
<script type="math/tex; mode=display">n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) \le 2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big</script>
<p>Next, let us upper bound the RHS. We see that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big\;&{\overset{(i)}{=}}\; \Big\theta_S^T(X_S^TX_{S^c})\theta_{S^c}\Big\\
\;&{\overset{(ii)}{=}}\; \left \sum_{i\in S, j\in S^c} X_i^TX_j (\tilde{\theta}_S)_i(\tilde{\theta}_{S^c})_j \right \\
\;&{\overset{(iii)}{\le}}\; \frac{n\gamma}{s} \sum_{i\in S, j\in S^c} (\tilde{\theta}_S)_i(\tilde{\theta}_{S^c})_j \\
\;&{\overset{(iv)}{=}}\; \frac{n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1
\end{align} %]]></script>
<p>where (i) follows from a little thought about how matrix multiplication and
quadratic forms work. In particular, if we expanded out the LHS, we would get a
sum with lots of terms that are zero since <script type="math/tex">(\tilde{\theta}_S)_i</script> or
<script type="math/tex">(\tilde{\theta}_{S^c})_j</script> would cancel them out. (To be clear, <script type="math/tex">\theta_S \in
\mathbb{R}^s</script> and <script type="math/tex">\theta_{S^c} \in \mathbb{R}^{ds}</script>.) Step (ii) follows
from definition, step (iii) follows from the provided Pairwise Incoherence bound
(note the need to multiply by <script type="math/tex">n/n</script>), and step (iv) follows from how</p>
<script type="math/tex; mode=display">\\theta_S\_1\\theta_{S^c}\_1 = \Big((\theta_S)_1 +\cdots+ (\theta_S)_s\Big)
\Big((\theta_{S^c})_1 +\cdots+ (\theta_{S^c})_{ds}\Big)</script>
<p>and thus it is clear that the product of the <script type="math/tex">L_1</script> norms consists of the sum
of all possible combination of indices with nonzero values.</p>
<p>The last thing we note is that from part (a), if we assumed that <script type="math/tex">\gamma \le
1/3</script>, then a lower bound on <script type="math/tex">\lambda_{\rm min}
\left(\frac{X_S^TX_S}{n}\right)</script> is <script type="math/tex">2/3</script>. Putting the pieces together, we
get the following three inequalities</p>
<script type="math/tex; mode=display">\frac{2n\\theta_S\_2^2}{3} \;\;\le \;\;
n\\theta_S\_2^2 \cdot \lambda_{\rm min}\left(\frac{X_S^TX_S}{n}\right) \;\;\le \;\;
2\Big\tilde{\theta}_S^T(X^TX)\tilde{\theta}_{S^c}\Big \;\; \le \;\;
\frac{2n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1</script>
<p>We can provide a lower bound for the first term above. Using the fact that
<script type="math/tex">\\theta_S\_1^2 \le s\\theta_S\_2^2</script>, we get
<script type="math/tex">\frac{2n\\theta_S\_1^2}{3s} \le \frac{2n\\theta_S\_2^2}{3}</script>. The final
step is to tie the lower bound here with the upper bound from the set of three
inequalities above. This results in</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{2n\\theta_S\_1^2}{3s} \le \frac{2n\gamma}{s}\\theta_S\_1\\theta_{S^c}\_1 \quad &\iff \quad
\frac{\\theta_S\_1^2}{3} \le \gamma \\theta_S\_1\\theta_{S^c}\_1 \\
&\iff \quad \\theta_S\_1 \le 3\gamma \\theta_{S^c}\_1
\end{align} %]]></script>
<p>Under the same assumption earlier (that <script type="math/tex">% <![CDATA[
\gamma < 1/3 %]]></script>) it follows directly
that <script type="math/tex">% <![CDATA[
\\theta_S\_1 < \\theta_{S^c}\_1 %]]></script>, as claimed. Whew!</p>
<p>Tricks used:</p>
<ul>
<li>CauchySchwarz</li>
<li>Norm Properties</li>
<li>Variational Representation (of eigenvalues)</li>
</ul>
<p><strong>Comments</strong>: Actually, for part (a), one can prove this more directly by using
the <a href="https://en.wikipedia.org/wiki/Gershgorin_circle_theorem">Gershgorin Circle Theorem</a>, a <em>very</em> useful Theorem with a surprisingly
simple proof. But I chose this way above so that we can make use of the
variational representation for eigenvalues. There are also variational
representations for <em>singular values</em>.</p>
<p>The above uses a <em>lot</em> of norm properties. One example was the use of <script type="math/tex">\v\_1
\le \sqrt{s}\v\_2</script>, which can be proved via CauchySchwarz. The extension to
this is that <script type="math/tex">\v\_2 \le \sqrt{s}\v\_\infty</script>. These are quite handy.
Another example, which is useful when dealing with specific subsets, is to
understand how the <script type="math/tex">L_1</script> and <script type="math/tex">L_2</script> norms behave. Admittedly, getting all the
steps right for part (b) takes a <em>lot</em> of hassle and attention to details, but
it is certainly satisfying to see it work.</p>
<h2 id="closingthoughts">Closing Thoughts</h2>
<p>I hope this post serves as a useful reference for me and to anyone else who
might need to use one of these tricks to understand some machine learning and
statisticsrelated math.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:miller">
<p>One of my undergraduate mathematics professors, <a href="https://web.williams.edu/Mathematics/sjmiller/public_html/williams/welcome.html">Steven J.
Miller</a>, would love this trick, as his two favorite tricks in mathematics
are <em>adding zero</em> (along with, of course, multiplying by one). <a href="#fnref:miller" class="reversefootnote">↩</a></p>
</li>
<li id="fn:downstairs">
<p>Or “downstairs” as professor <a href="https://people.eecs.berkeley.edu/~jordan/">Michael I. Jordan</a> often puts it
(and obviously, “upstairs” for the numerator). <a href="#fnref:downstairs" class="reversefootnote">↩</a></p>
</li>
<li id="fn:time">
<p>It can take some time and effort to visualize and process all this
information. I find it helpful to draw some of these out with pencil and
paper, and also to assume without loss of generality that <script type="math/tex">S</script> corresponds
to the first “block” of <script type="math/tex">\theta</script>, and <script type="math/tex">S^c</script> therefore corresponds to the
second (and last) “block.” Please contact me if you spot typos; they’re
really easy to make here. <a href="#fnref:time" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 06 May 2017 03:00:00 0700
https://danieltakeshi.github.io/2017/05/06/mathematicaltrickscommonlyusedinmachinelearningandstatistics
https://danieltakeshi.github.io/2017/05/06/mathematicaltrickscommonlyusedinmachinelearningandstatistics