My Blog Posts, in Reverse Chronological Order
subscribe via RSS
Deep Reinforcement Learning (CS 294112) at Berkeley, Take Two
Back in Fall 2015, I took the first edition of Deep Reinforcement Learning (CS 294112) at Berkeley. As usual, I wrote a blog post about the class; you can find more about other classes I’ve taken by searching the archives.
In that blog post, I admitted that CS 294112 had several weaknesses, and also that I didn’t quite fully understand the material. Fast forward to today, and I’m pleased to say that:

There has been a second edition of CS 294112, taught this past spring semester. It was a threecredit, full semester course and therefore more substantive than the previous edition which was twocredits and lasted only eight weeks. Furthermore, the slides, homework assignments, and the lecture recordings are all publicly available online. Check out the course website for details. You can find the homework assignments in this GitHub repository (I had to search a bit for this).

I now understand much more about deep reinforcement learning and about how to use TensorFlow.
These developments go hand in hand, because I spent much of the second half of the Spring 2017 semester selfstudying the second edition of CS 294112. (To be clear, I was not enrolled in the class.) I know I said I would first selfstudy a few other courses in a previous blog post, but I couldn’t pass up such a prime opportunity to learn about deep reinforcement learning. Furthermore, the field moves so fast that I worried that if I didn’t follow what was happening now, I would never be able to catch up to the research frontier if I tried to do so in a year.
The class had four homework assignments, and I completed all of them with the exception of skipping the DAgger algorithm implementation in the first homework. The assignments were extremely helpful for me to understand how to better use TensorFlow, and I finally feel comfortable using it for my personal projects. If I can spare the time (famous last words) I plan to write some TensorFlowrelated blog posts.
The video lecture were a nice bonus. I only watched a fraction of them, though. This was in part due to time constraints, but also in part due to the lack of captions. The lecture recordings are on YouTube, and in YouTube, I can turn on automatic captions which helps me to follow the material. However, some of the videos didn’t enable that option, so I had to skip those and just read the slides since I wasn’t following what was being said. As far as I remember, automatic captions are provided as an option so long as whoever uploaded the video enables some setting, so maybe someone forgot to do so? Fortunately, the lecture video on policy gradients has captions enabled, so I was able to watch that one. Oh, and I wrote a blog post about the material.
Another possible downside to the course, though this one is extremely minor, is that the last few class sessions were not recorded, since those were when students presented their final projects. Maybe the students wanted some level of privacy? Oh well, I suppose there’s way too many other interesting projects available anyway (by searching GitHubs, arXiv preprints, etc.) to worry about this thing.
I want to conclude with a huge thank you to the course staff. Thank you for helping to spread knowledge about deep reinforcement learning with a great class and with lots of publicly available material. I really appreciate it.
Alan Turing: The Enigma
I finished reading Andrew Hodges’ book Alan Turing: The Engima, otherwise known as the definitive biography of mathematician, computer scientist, and code breaker Alan Turing. I was inspired to read the book in part because I’ve been reading lots of AIrelated books this year^{1} and in just about every one of those books, Alan Turing is mention in some form. In addition, I saw the film The Imitation Game, and indeed this is the book that inspired it. I bought the 2014 edition of the book — with The Imitation Game cover — during a recent visit to the National Cryptology Museum.
The author is Andrew Hodges, who at that time was a mathematics instructor at the University of Oxford (he’s now retired). He maintains a website where he commemorates Alan Turing’s life and achievements. I encourage the interested reader to check it out. Hodges has the qualifications to write about the book, being deeply versed in mathematics. He also appears to be gay himself.^{2}
After reading the book, my immediate thoughts relating to the positive aspects of the books are:

The book is organized chronologically and the eight chapters are indicated with date ranges. Thus, for a biography of this size, it is relatively straightforward to piece together a mental timeline of Alan Turing’s life.

The book is detailed. Like, wow. The edition I have is 680 pages, not counting the endnotes at the back of the book which command an extra 30 or so pages. Since I read almost every word of this book (I skipped a few endnotes), and because I tried to stay alert when reading this book, I felt like I got a clear picture of Turing’s life, along with what life must have been like during the World War IIera.

The book contains quotes and writings from Turing that show just how far ahead of his time he was. For instance, even today people are still utilizing concepts from his famous 1936 paper On Computable Numbers, with an Application to the Entscheidungsproblem and his 1950 paper Computing Machinery and Intelligence. The former introduced Turing Machines, the latter introduced the famous Turing Test. Fortunately, I don’t think there was much exaggeration of Turing’s accomplishments, unlike the The Imitation Game. When I was reading his quotes, I often had to remind myself that “this is the 1940s or 1950s ….”

The book showcases the struggles of being gay, particularly during a time when homosexual activity was a crime. The book actually doesn’t seem to cover some of his struggles in the early 1950s as much as I thought it would be, but it was probably difficult to find sufficient references for this aspect of his life. At the very least, readers today should appreciate how much our attitude towards homosexuality has improved.
That’s not to say there weren’t a few downsides. Here are some I thought of:

Related to what I mentioned earlier, it is long. It too me a month to finish, and the writing is in “1983style” which makes it more difficult for me to understand. (By contrast, I read both of Richard Dawkins’ recent autobiographies, which combine to be roughly the same length as Hodges’ book, and Dawkins’ books were much easier to read.) Now, I find Turing’s life very interesting so this is more of a “neutral” factor to me, but I can see why the casual reader might be dissuaded from reading this book.

Much of the material is technical even to me. I understand the basics of Turing Machines but certainly not how the early computers were built. The hardest parts of the book to read are probably in chapters six and seven (out of eight total). I kept asking to myself “what’s a cathode ray”?
To conclude, the book is an extremely detailed overview of Turing’s life which at times may be technically challenging to read.
I wonder what Alan Turing would think about AI today. The widelyused AI undergraduate textbook by Stuart Russell and Peter Norvig concludes with the follow prescient quote by Turing:
We can only see a short distance ahead, but we can see plenty there that needs to be done.
Earlier scientists have an advantage in setting their legacy in their fields since it’s easier to make landmark contributions. I view Charles Darwin, for instance, as the greatest biologist who has ever lived, and no matter how skilled today’s biologists are, I believe none will ever be able to surpass Darwin’s impact. The same goes today for Alan Turing, who (possibly along with John von Neumann) is one of the two preeminent computer scientists who has ever lived.
Despite all the talent that’s out there in computer science, I don’t think any one individual can possibly surpass Turing’s legacy on computer science and artificial intelligence.

Thus, the 2017 edition of my reading list post (here’s the 2016 version, if you’re wondering) is going to be very biased in terms of AI. Stay tuned! ↩

I only say this because people who are members of “certain groups” — where membership criteria is not due to choice but due to intrinsic human characteristics — tend to have more knowledge about the group than “outsiders.” Thus, a gay person by default has extra credibility when writing about being gay than would a straight person. A deaf person by default has extra credibility when writing about deafness than a hearing person. And so on. ↩
Understanding Deep Learning Requires Rethinking Generalization: My Thoughts and Notes
The paper “Understanding Deep Learning Requires Rethinking Generalization” (arXiv link) caused quite a stir in the Deep Learning and Machine Learning research communities. It’s the rare paper that seems to have high research merit — judging from being awarded one of three Best Paper awards at ICLR 2017 — but is also readable. Hence, it got the most amount of comments of any ICLR 2017 submission on OpenReview. It has also been discussed on reddit and was recently featured on The Morning Paper blog. I was aware of the paper shortly after it was uploaded to arXiv, but never found the time to read it in detail until now.
I enjoyed reading the paper, and while I agree with many readers that some of the findings might be obvious, the paper nonetheless seems deserving of the attention it has been getting.
The authors conveniently put two of their important findings in centered italics:
Deep neural networks easily fit random labels.
and
Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.
I will also quote another contribution from the paper that I find interesting:
We complement our empirical observations with a theoretical construction showing that generically large neural networks can express any labeling of the training data.
(I go through the derivation later in this post.)
Going back to their first claim about deep neural networks fitting random labels, what does this mean from a generalization perspective? (Generalization is just the difference between training error and testing error.) It means that we cannot come up with a “generalization function” that can take in a neural network as input and output a generalization quality score. Here’s my intuition:

What we want: let’s imagine an arbitrary encoding of a neural network designed to give as much deterministic information as possible, such as the architecture and hyperparameters, and then use that encoding as input to a generalization function. We want that function to give us a number representing generalization quality, assuming that the datasets are allowed to vary. The worst generalization occurs when a fixed neural network gets excellent training error but could get either the same testing error (awesome!), or get testset performance no better than random guessing (ugh!).

Reality: unfortunately, the best we can do seems to be no better than the worst case. We know of no function that can provide bounds on generalization performance across all datasets. Why? Let’s use the LeNet architecture and MNIST as an example. With the right architecture, generalization error is very small as both training and testing performance are in the high 90 percentages. With a second data set that consists of the same MNIST digits, but with the labels randomized, that same LeNet architecture can do no better than random guessing on the test set, even though the training performance is extremely good (or at least, it should be). That’s literally as bad as we can get. There’s no point in developing a function to measure generalization when we know it can only tell us that generalization will be in between zero (i.e. perfect) and the difference between zero and random guessing (i.e. the worst case)!
As they later discuss in the paper, regularization can be used to improve generalization, but will not be sufficient for developing our desired generalization criteria.
Let’s briefly take a step back and consider classical machine learning, which provides us with generalization criteria such as VCdimension, Rademacher complexity, and uniform stability. I learned about VCdimension during my undergraduate machine learning class, Rademacher complexity during STAT 210B this past semester, and … actually I’m not familiar with uniform stability. But intuitively … it makes sense to me that classical criteria do not apply to deep networks. To take the Rademacher complexity example: a function class which can fit to arbitrary noise vectors presents the trivial bound of one, which is like saying: “generalization is between zero and the worst case.” Not very helpful.
The paper then proceeds to describe their testing scenario, and packs some important results in the figure reproduced below:
This figure represents a neural network classifying the images in the widelybenchmarked CIFAR10 dataset. The network the authors used is a simplified version of the Inception architecture.

The first subplot represents five different settings of the labels and input images. To be clear on what the “gaussian” setting means, they use a Gaussian distribution to generate random pixels (!!) for every image. The mean and variance of that Gaussian are “matched to the original dataset.” In addition, the “shuffled” and “random” pixels apply a random permutation to the pixels, with the same permutation to all images for the former, and different permutations for the latter.
We immediately see that the neural network can get zero training error on all the settings, but the convergence speed varies. Intuition suggests that the dataset with the correct labels and the one with the same shuffling permutation should converge quickly, and this indeed is the case. Interestingly enough, I thought the “gaussian” setting would have the worst performance, but that prize seems to go to “random labels.”

The second subplot measures training error when the amount of label noise is varied; with some probability , each image independently has its labeled corrupted and replaced with a draw from the discrete uniform distribution over the classes. The results show that more corruption slows convergence, which makes sense. By the way, using a continuum of something is a common research tactic and something I should try for my own work.

Finally, the third subplot measures generalization error under label corruption. As these data points were all measured after convergence, this is equivalent to the test error. The results here also make a lot of sense. Test set error should be approaching 90 percent because CIFAR10 has 10 classes (that’s why it’s called CIFAR10!).
My major criticism of this figure is not that the results, particularly in the second and third subplots, might seem obvious but that the figure lacks error bars. Since it’s easy nowadays to program multiple calls in a bash script or something similar, I would expect at least three trials and with error bars (or “regions”) to each curve in this figure.
The next section discusses the role of regularization, which is normally applied to prevent overfitting to the training data. The classic example is with linear regression and a dataset of several points arranged in roughly a linear fashion. Do we try to fit a straight line through these points, which might have lots of training error, or do we take a highdimensional polynomial and fit every point exactly, even if the resulting curve looks impossibly crazy? That’s what regularization helps to control. Explicit regularization in linear regression is the term in the following optimization problem:
I presented this in an earlier blog post.
To investigate the role of regularization in Deep Learning, the authors test with and without regularizers. Incidentally, the use of above is not the only type of regularization. There are also several others: data augmentation, dropout, weight decay, early stopping (implicit) and batch normalization (implicit). These are standard tools in the modern Deep Learning toolkit.
They find that, while regularization helps to improve generalization performance, it is still possible to get excellent generalization even with no regularization. They conclude:
In summary, our observations on both explicit and implicit regularizers are consistently suggesting that regularizers, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers [are] removed.
On a side note, the regularization discussion in the paper feels out of order and the writing sounds a bit off to me. I wish they had more time to fix this, as the regularization portion of the paper contains most of my English languagerelated criticism.
Moving on, the next section of the paper is about finitesample expressivity, or understanding what functions neural networks can express given a finite number of samples. The authors state that the previous literature focuses on population analysis where one can assume an arbitrary number of samples. Here, instead, they assume a fixed set of training points . This seems easier to understand anyway.
They prove a theorem that relates to the third major contribution I wrote earlier: “that generically large neural networks can express any labeling of the training data.” Before proving the theorem, let’s begin with the following lemma:
Lemma 1. For any two interleaving sequences of real numbers
the matrix has full rank. Its smallest eigenvalue is .
Whenever I see statements like these, my first instinct is to draw out the matrix. And here it is:
where (i) follows from the interleaving sequence assumption. This matrix is lowertriangular, and moreover, all the nonzero elements are positive. We know from linear algebra that lower triangular matrices
 are invertible if and only if the diagonal elements are nonzero
 have their eigenvalues taken directly from the diagonal elements
These two facts together prove Lemma 1. Next, we can prove:
Theorem 1. There exists a twolayer neural network with ReLU activations and weights that can represent any function on a sample of size in dimensions.
Consider the function
with and . (There’s a typo in the paper, is a function from , not ). This can certainly be represented by a depth2 ReLU network. To be clear on the naming convention, “depth2” does not count the input layer, so our network should only have one ReLU layer in it as the output shouldn’t have ReLUs applied to it.
Here’s how to think of the network representing . First, assume that we have a minibatch of elements, so that is the data matrix. The depth2 network representing can be expressed as:
where and the zerovector used in the maximum “broadcast” as necessary in Python code.
Given a fixed dataset of distinct inputs with labels , we must be able to find settings of and such that for all . You might be guessing how we’re doing this: we must reduce this to the interleaving property in Lemma 1. Due to the uniqueness of the , it is possible to find to make the terms satisfy the interleaving property. Then we have a full rank solution, hence results in as our final weights, where is precisely that matrix from Lemma 1! We also see that, indeed, there are weights in the network. This is an interesting and fun proof, and I think variants of this question would work well as a homework assignment for a Deep Learning class.
The authors conclude the paper by trying to understand generalization with linear models, in the hope that some of the intuition will transfer over to the Deep Learning setting. With linear models, given some weights resulting from the optimization problem, what can we say about generalization just by looking at it? Curvature is one popular metric to understand the quality of the minima (which is not necessarily the same as the generalization criteria!), but the Hessian is independent of , so in fact it seems impossible to use curvature for generalization. I’m convinced this is true for the normal mean square loss, but is this still true if the loss function were, say, the cube of the difference? After all, there are only two derivatives applied on , right?
The authors instead urge us to think of stochastic gradient descent instead of curvature when trying to measure quality. Assuming that , the stochastic gradient descent update consists of a series of “linear combination” updates, and hence the result is just a linear combination of linear combinations of linear combinations … (and so forth) … which at the end of the day, remains a linear combination. (I don’t think they need to assume if we can add an extra 1 to all the data points.) Consequently, they can fit any set of labels of the data by solving a linear equation, and indeed, they get strong performance on MNIST and CIFAR10, even without regularization.
They next try to relate this to a minimum norm interpretation, though this is not a fruitful direction because their results are worse when they try to find minimum norm solutions. On MNIST, their best solution using some “Gabor wavelet transform” (what?), is twice as better as the minimum norm solution. I’m not sure how much stock to put into this section, other than how I like their perspective of thinking of SGD as an implicit regularizer (like batch normalization) rather than an optimizer. The line between the categories is blurring.
To conclude, from my growing experience with Deep Learning, I don’t find their experimental results surprising. That’s not to say the paper was entirely predictable, but think of it this way: if I were a computer vision researcher preAlexNet, I would be more surprised at reading the AlexNet paper as I am today reading this paper. Ultimately, as I mentioned earlier, I enjoyed this paper, and while it was predictable (that word again…) that it couldn’t offer any solutions, perhaps it will be useful as a starting point to understanding generalization in Deep Learning.
Mathematical Tricks Commonly Used in Machine Learning and Statistics
I have passionately studied various machine learning and statistical concepts over the last few years. One thing I’ve learned from all this is that there are many mathematical “tricks” involved, whether or not they are explicitly stated. (In research papers, such tricks are often used without acknowledgment since it is assumed that anyone who can benefit from reading the paper has the mathematical maturity to fill in the details.) I thought it would be useful for me, and hopefully for a few interested readers, to catalogue a set of the common tricks here, and to see them applied in a few examples.
The following list, in alphabetical order, is a nonexhaustive set of tricks that I’ve seen:
 CauchySchwarz
 Integrating Probabilities into Expectations
 Introducing an Independent Copy
 Jensen’s Inequality
 Law of Iterated Expectation
 Lipschitz Functions
 Markov’s Inequality
 Norm Properties
 Series Expansions (e.g. Taylor’s)
 Stirling’s Approximation
 Symmetrization
 Take a Derivative
 Union Bound
 Variational Representations
If the names are unclear or vague, the examples below should clarify. All the tricks are used except for the law of iterated expectation, i.e. . (No particular reason for that omission; it just turns out the exercises I’m interested in didn’t require it.)
Example 1: Maximum of (Not Necessarily Independent!) subGaussians
I covered this problem in my last post here so I will not repeat the details. However, there are two extensions to that exercise which I thought would be worth noting.
First, To prove an upper bound for the random variable , it suffices to proceed as we did earlier in the nonabsolute value case, but augment our subGaussian variables with the set . It’s OK to do this because no independence assumptions are needed. Then it turns out that an upper bound can be derived as
This is the same as what we had earlier, except the “2” is now outside the square root. It’s quite intuitive.
Second, consider how we can prove the following bound:
We start by applying the standard technique of multiplying by , exponentiating and then applying Markov’s Inequality with our nonnegative random variable :
where in (i) we used a bound previously determined in our bound on (it came out of an intermediate step), and then used the fact that the term in the exponential is a convex quadratic to find the minimizer value via differentiation in (ii).
At this point, to satisfy the desired inequality, we compare terms in the exponentials and claim that with ,
This will result in our desired bound. It therefore remains to prove this, but it reduces to checking that
and the left hand side is nonnegative. Hence, the desired bound holds.
Tricks used:
 Jensen’s Inequality
 Markov’s Inequality
 Take a Derivative
 Union Bound
Comments: My earlier blog post (along with this one) shows what I mean when I say “take a derivative.” It happens when there is an upper bound on the right hand side and we have a free parameter (or ) which we can optimize to get the tighest possible bound. Often times, such a is explicitly introduced via Markov’s Inequality, as we have here. Just make sure to double check that when taking a derivative, you’re getting a minimum, not a maximum. In addition, Markov’s Inequality can only be applied to nonnegative random variables, which is why we often have to exponentiate the terms inside a probability statement first.
Note the use of convexity of the exponential function. It is very common to see Jensen’s inequality applied with the exponential function. Always remember that !!
The procedure that I refer to as the “union bound” when I bound a maximum by a sum isn’t exactly the canonical way of doing it, since that typically involves probabilities, but it has a similar flavor. More formally, the union bound states that
for countable sets of events . When we define a set of events based on a maximum of certain variables, that’s the same as taking the union of the individual events.
On a final note, be on the lookout for applications of this type whenever a “maximum” operation is seen with something that resembles Gaussians. Sometimes this can be a bit subtle. For instance, it’s not uncommon to use a bound of the form above when dealing with , the expectation of the norm of a standard Gaussian vector. In addition, when dealing with sparsity, often our “” or “” is actually something like or another combinatoricsstyle value. Seeing a “log” accompanied by a square root is a good clue and may help identify such cases.
Example 2: Bounded Random Variables are SubGaussian
This example is really split into two parts. The first is as follows:
Prove that Rademacher random variables are subGaussian with parameter .
The next is:
Prove that if is a zeromean and has support , then is subGaussian with parameter (at most) .
To prove the first part, let be a Rademacher random variable. For , we have
and thus the claim is satisfied by the definition of a subGaussian random variable. In (i), we removed the expectation by using facts from Rademacher random variables, in (ii) we used the series expansions of the exponential function, in (iii) we simplified by removing the odd powers, in (iv) we used the clever trick that , and in (v) we again used the exponential function’s power series.
To prove the next part, observe that for any , we have
which shows by definition that is subGaussian with parameter . In (i), we cleverly introduce an extra independent copy inside the exponent. It’s zeromean, so we can insert it there without issues.^{1} In (ii), we use Jensen’s inequality, and note that we can do this with respect to just the random variable . (If this is confusing, just think of the expression as a function of and ignore the outer expectation.) In (iii) we apply a clever symmetrization trick by multiplying a Rademacher random variable to . The reason why we can do this is that is already symmetric about zero. Hence, inserting the Rademacher factor will maintain that symmetry (since Rademachers are only +1 or 1). In (iv), we applied the Rademacher subGaussian bound with held fixed, and then in (v), we finally use the fact that .
Tricks used:
 Introducing an Independent Copy
 Jensen’s Inequality
 Series Expansions (twice!!)
 Symmetrization
Comments: The first part is a classic exercise in theoretical statistics, one which tests your ability to understand how to use the power series of exponential functions. The first part involved converting an exponential function to a power series, and then later doing the reverse. When I was doing this problem, I found it easiest to start by stating the conclusion — that we would have somehow — and then I worked backwards. Obviously, this only works when the problem gives us the solution!
The next part is also “classic” in the sense that it’s often how students (such as myself) are introduced to the symmetrization trick. The takeaway is that one should be on the lookout for anything that seems symmetric. Or, failing that, perhaps introduce symmetry by adding in an extra independent copy, as we did above. But make sure that your random variables are zeromean!!
Example 3: Concentration Around Median and Means
Here’s the question:
Given a scalar random variable , suppose that there are positive constants such that
for all .
(a) Prove that
(b) Prove that for any median , we have
for all , where and .
To prove the first part, note that
where (i) follows from definition, (ii) follows from the “integrating probabilities into expectations” trick (which I will describe shortly), (iii) follows from the provided bound, and (iv) follows from standard calculus (note the multiplication of for mathematical convenience). This proves the first claim.
This second part requires some clever insights to get this to work. One way to start is by noting that:
and where the last inequality follows from the bound provided in the question. For us to be able to apply that bound, assume without loss of generality that , meaning that our term is positive and that we can increase the probability by inserting in absolute values. The above also shows that
We next tackle the core of the question. Starting from the left hand side of the desired bound, we get
where step (i) follows from adding zero, step (ii) follows from the Triangle Inequality, and (iii) follows from the provided bound based on the expectation. And yes, this is supposed to work only for when . The way to get around this is that we need to assume is greater than some quantity. After some algebra, it turns out a nice condition for us to enforce is that , which in turn will make . If , then the desired bound is attained because
a fact which can be derived through some algebra. Thus, the remainder of the proof boils down to checking the case that when , we have
and this is proved by analyzing roots of the quadratic and solving for .
Tricks used:
 Integrating Probabilities into Expectations
 Triangle Inequality
Comments: The trick “integrating probabilities into expectations” is one which I only recently learned about, though one can easily find it (along with the derivation) on the Wikipedia page for the expected values. In particular, note that for a positive real number , we have
and in the above, I use this trick with . It’s quite useful to convert between probabilities and expectations!
The other trick above is using the triangle inequality in a clever way. The key is to observe that when we have something like , if we increase the value of , then we increase that probability. This is another common trick used in proving various bounds.
Finally, the above also shows that when we have constants , it pays to be clever in how we assign those values. Then the remainder is some bruteforce computation. I suppose it also helps to think about inserting s whenever we have a probability and a median.
Example 4: Upper Bounds for “Balls”
Consider the set
We often write the number of nonzeros in as like this even though is not technically a norm. This exercise consists of three parts:
(a) Show that where consists of all subsets of of size , and is a subvector of (of size ) indexed by those components. Note that by this definition, the cardinality of is equal to .
(b) Show that for any fixed subset of cardinality , we have .
(c) Establish the claim that .
To be clear on the notation, and refers to the Gaussian complexity of that set. It is, roughly speaking, a way to measure the “size” of a set.
To prove (a), let and let indicate the support of (i.e. where its nonzeros occur). For any (which we later treat to be sampled from , though the immediate analysis below does not require that fact) we have
where refers to the vector taking only the nonzero components from . The first inequality follows from CauchySchwarz. In addition, by standard norm properties, taking results in the case when equality is attained. The claim thus follows. (There are some technical details needed regarding which of the maximums — over the set sizes or over the vector selection — should come first, but I don’t think the details are critical for me to know.)
For (b), we first claim that the function defined as is Lipschitz with respect to the Euclidean norm with Lipschitz constant . To see this, observe that when and are both dimensional vectors, we have
where (i) follows from the reverse triangle inequality for normed spaces and (ii) follows from how the vector cannot have more nonzero terms than but must otherwise match it for indices lying in the subset .
The fact that is Lipschitz means that we can apply a theorem regarding tail bounds of Lipschitz functions of Gaussian variables. The function here doesn’t require its input to consist of vectors with IID standard Gaussian components, but we have to assume that the input is like that for the purposes of the theorem/bound to follow. More formally, for all we have
where (i) follows from how and thus we are just decreasing the threshold for the event (hence making it more likely) and (ii) follows from the theorem, which provides an in the denominator of the exponential, but here.
Finally, to prove (c), we first note that the previous part’s theorem guaranteed that the function is subGaussian with parameter . Using this, we have
where (i) applies the bound for a maximum over subGaussian random variables for all the sets (see Example 1 earlier), each with parameter , and (ii) applies an approximate bound due to Stirling’s approximation and ignores the constants of and . The careful reader will note that Example 1 required zeromean subGaussian random variables, but we can generally get around this by, I believe, subtracting away a mean and then readding later.
Tricks used:
 CauchySchwarz
 Jensen’s Inequality
 Lipschitz Functions
 Norm Properties
 Stirling’s Approximation
 Triangle Inequality
Comments: This exercise involves a number of tricks. The fact that follows from how
due to Jensen’s inequality and how for . Fiddling with norms, expectations, and square roots is another common way to utilize Jensen’s inequality (in addition to using Jensen’s inequality with the exponential function, as explained earlier). Moreover, if you see norms in a probabilistic bound statement, you should immediately be thinking of the possibility of using a theorem related to Lipschitz functions.
The example also uses the (reverse!) triangle inequality for norms:
This can come up quite often and is the noncanonical way of viewing the triangle inequality, so watch out!
Finally, don’t forget the trick where we have . This comes from an application of Stirling’s approximation and is seen frequently in cases involving sparsity, where components are “selected” out of total. The maximum over a finite set should also provide a big hint regarding the use of a subGaussian bound over maximums of (subGaussian) variables.
Example 5: Gaussian Complexity of Ellipsoids
Recall that the space consists of all real sequences such that . Given a strictly positive sequence , consider the associated ellipse
(a) Prove that the Gaussian complexity satisfies the bounds
(b) For a given radius , consider the truncated set
Obtain upper and lower bounds on its Gaussian complexity that are tight up to universal constants independent of and .
To prove (a), we first start with the upper bound. Letting indicate a sequence of IID standard Gaussians , we have
where (i) follows from definition, (ii) follows from multiplying by one, (iii) follows from a clever application of the CauchySchwarz inequality for sequences (or more generally, Holder’s Inequality), (iv) follows from the definition of , (v) follows from Jensen’s inequality, and (vi) follows from linearity of expectation and how .
We next prove the lower bound. First, we note a wellknown result that where indicates the Rademacher complexity of the set. Thus, our task now boils down to showing that . Letting be IID Rademachers, we first begin by proving the upper bound
where (i) follows from definition, (ii) follows from the symmetric nature of the class of (meaning that WLOG we can pick for all ) and then multiplying by one, (iii), follows from CauchySchwarz again, and (iv) follows from the provided bound in the definition of .
We’re not done yet: we actually need to show equality for this, or at the very least prove a lower bound instead of an upper bound. However, if one chooses the valid sequence such that , then equality is attained since we get
in one of our steps above. This proves part (a).
For part (b), we construct two ellipses, one that contains and one which is contained inside it. Let . Then we claim that the ellipse defined out of this sequence (i.e. treating “” as our “”) will be contained in . We moreover claim that the ellipse defined out of the sequence for all contains , i.e. . If this is true, it then follows that
because the definition of Gaussian complexity requires taking a maximum of over a set, and if the set grows larger via set containment, then the Gaussian complexity can only grow larger. In addition, the fact that the upper and lower bounds are related by a constant suggests that there should be extra lower and upper bounds utilizing universal constants independent of and .
Let us prove the two set inclusions previously described, as well as develop the desired upper and lower bounds. Suppose . Then we have
and
In both cases, the first inequality is because we can only decrease the value in the denominator.^{2} The last inequality follows by assumption of membership in . Both requirements for membership in are satisfied, and therefore, implies and thus the first set containment. Moving on to the second set containment, suppose . We have
where (i) follows from a “union bound”style argument, which to be clear, happens because for every term in the summation, we have either or added to the summation (both positive quantities). Thus, to make the value larger, just add both terms! Step (ii) follows from the assumption of membership in . Thus, we conclude that , and we have proved that
The final step of this exercise is to develop a lower bound on the left hand side and an upper bound on the right hand side that are close up to universal constants. But we have reduced this to an instance of part (a)! Thus, we simply apply the lower bound for and the upper bound for and obtain
as our final bounds on . (Note that as a sanity check, the constant offset is less than one.) This proves part (b).
Tricks used:
 CauchySchwarz
 Jensen’s Inequality
 Union Bound
Comments: This exercise on the surface looks extremely challenging. How does one reason about multiple infinite sequences, which furthermore may or may not involve squared terms? I believe the key to tackling these problems is to understand how to apply CauchySchwarz (or more generally, Holder’s Inequality) for infinite sequences. More precisely, Holder’s Inequality for sequences spaces states that
(It’s actually more general for this, since we can assume arbitrary positive powers and so long as , but the easiest case to understand is when .)
Holder’s Inequality is enormously helpful when dealing with sums (whether infinite or not), and especially when dealing with two sums if one does not square its terms, but the other one does.
Finally, again, think about Jensen’s inequality whenever we have expectations and a square root!
Example 6: Pairwise Incoherence
Given a matrix , suppose it has normalized columns ( for all ) and pairwise incoherence upper bounded as .
(a) Let be any subset of size . Show that there is a function such that as long as is sufficiently small, where is the matrix formed by extracting the columns of whose indices are in .
(b) Prove, from first principles, that satisfies the restricted nullspace property with respect to as long as .
To clarify, the pairwise incoherence of a matrix is defined as
where denotes the th column of . Intuitively, it measures the correlation between any columns, though it subtracts an indicator at the end so that the maximal case does not always correspond to the case when . In addition, the matrix as defined in the problem looks like:
where the 1s in the diagonal are due to the assumption of having normalized columns.
First, we prove part (a). Starting from the variational representation of the minimum eigenvalue, we consider any possible with Euclidean norm one (and thus this analysis will apply for the minimizer which induces the minimum eigenvalue) and observe that
where (i) follows from the definition of a quadratic form (less formally, by matrix multiplication), (ii) follows from the assumption, (iii) follows from noting that
which in turn follows from the pairwise incoherence assumption that . Step (iv) follows from definition, and (v) follows from how for dimensional vectors.
The above applies for any satisfactory . Putting together the pieces, we conclude that
which follows if is sufficiently small.
To prove the restricted nullspace property in (b), we first suppose that and . Define dimensional vectors and which match components of for the indices within their respective sets or , and which are zero otherwise.^{3} Supposing that corresponds to the subset of indices of of the largest elements in absolute value, it suffices to show that , because then we can never violate this inequality (and thus the restricted nullspace property holds).
We first show a few facts which we then piece together to get the final result. The first is that
where (i) follows from the assumption that is in the kernel of , (ii) follows from how , (iii) follows from expanding the term, and (iv) follows from carefully noting that
where in the inequality, we have simply chosen as our , which can only make the bound worse. Then step (iv) follows immediately. Don’t forget that , because the latter involves a vector that (while longer) only has extra zeros. Incidentally, the above uses the variational representation for eigenvalues in a way that’s more convenient if we don’t want to restrict our vectors to have Euclidean norm one.
We conclude from the above that
Next, let us upper bound the RHS. We see that
where (i) follows from a little thought about how matrix multiplication and quadratic forms work. In particular, if we expanded out the LHS, we would get a sum with lots of terms that are zero since or would cancel them out. (To be clear, and .) Step (ii) follows from definition, step (iii) follows from the provided Pairwise Incoherence bound (note the need to multiply by ), and step (iv) follows from how
and thus it is clear that the product of the norms consists of the sum of all possible combination of indices with nonzero values.
The last thing we note is that from part (a), if we assumed that , then a lower bound on is . Putting the pieces together, we get the following three inequalities
We can provide a lower bound for the first term above. Using the fact that , we get . The final step is to tie the lower bound here with the upper bound from the set of three inequalities above. This results in
Under the same assumption earlier (that ) it follows directly that , as claimed. Whew!
Tricks used:
 CauchySchwarz
 Norm Properties
 Variational Representation (of eigenvalues)
Comments: Actually, for part (a), one can prove this more directly by using the Gershgorin Circle Theorem, a very useful Theorem with a surprisingly simple proof. But I chose this way above so that we can make use of the variational representation for eigenvalues. There are also variational representations for singular values.
The above uses a lot of norm properties. One example was the use of , which can be proved via CauchySchwarz. The extension to this is that . These are quite handy. Another example, which is useful when dealing with specific subsets, is to understand how the and norms behave. Admittedly, getting all the steps right for part (b) takes a lot of hassle and attention to details, but it is certainly satisfying to see it work.
Closing Thoughts
I hope this post serves as a useful reference for me and to anyone else who might need to use one of these tricks to understand some machine learning and statisticsrelated math.

One of my undergraduate mathematics professors, Steven J. Miller, would love this trick, as his two favorite tricks in mathematics are adding zero (along with, of course, multiplying by one). ↩

Or “downstairs” as professor Michael I. Jordan often puts it (and obviously, “upstairs” for the numerator). ↩

It can take some time and effort to visualize and process all this information. I find it helpful to draw some of these out with pencil and paper, and also to assume without loss of generality that corresponds to the first “block” of , and therefore corresponds to the second (and last) “block.” Please contact me if you spot typos; they’re really easy to make here. ↩
Following Professor Michael I. Jordan's Advice: "Your Brain Needs Exercise"
The lone class I am taking this semester is STAT 210B, the second course in the PhDlevel theoretical statistics sequence. I took STAT 210A last semester, and I briefly wrote about the class here. I’ll have more to say about STAT 210B in late May, but in this post I’d first like to present an interesting problem that our professor, Michael I. Jordan, brought up in lecture a few weeks ago.
The problem Professor Jordan discussed was actually an old homework question, but he said that it was so important for us to know this that he was going to prove it in lecture anyway, without using any notes whatsoever. He also stated:
“Your brain needs exercise.”
He then went ahead and successfully proved it, and urged us to do the same thing.
OK, if he says to do that, then I will follow his advice and write out my answer in this blog post. I’m probably the only student in class who’s going to be doing this, but I’m already a bit unusual in having a longrunning blog. If any of my classmates are reading this and have their own blogs, let me know!
By the way, for all the students out there who say that they don’t have time to maintain personal blogs, why not take baby steps and start writing about stuff that accomplishes your educational objectives, such as doing practice exercises? It’s a nice way to make yourself look more productive than you actually are, since you would be doing those anyway.
Anyway, here at last is the question Professor Jordan talked about:
Let be a sequence of zeromean random variables, each subGaussian with parameter (No independence assumptions are needed). Prove that
for all .
This problem is certainly on the easier side of the homework questions we’ve had, but it’s a good baseline and I’d like to showcase the solution here. Like Professor Jordan, I will do this problem (a.k.a. write this blog post) without any form of notes. Here goes: for , we have
where:
 Step (i) follows from Jensen’s inequality. Yeah, that inequality is everywhere.
 Step (ii) follows from noting that one can pull the maximum outside of the exponential.
 Step (iii) follows from the classic union bound, which can be pretty bad but we don’t have much else to go on here. The key fact is that the exponential makes all terms in the sum positive.
 Step (iv) follows from applying the subGaussian bound to all variables, and then summing them together.
Next, taking logs and rearranging, we have
Since is isolated on the right hand side, we can differentiate it to find the tightest lower bound. Doing so, we get . Plugging this back in, we get
which proves the desired claim.
I have to reiterate that this problem is easier than the others we’ve done in STAT 210B, and I’m sure that over 90 percent of the students in the class could do this just as easily as I could. But this problem makes clear the techniques that are often used in theoretical statistics nowadays, so at minimum students should have a firm grasp of the content in this blog post.
Update April 23, 2017: In an earlier version of this post, I made an error with taking a maximum outside of an expectation. I have fixed this post. Thanks to Billy Fang for letting me know about this.
What I Wish People Would Say About Diversity
The two mainstream newspapers that I read the most, The New York Times and The Wall Street Journal, both have recent articles about diversity and the tech industry, a topic which by now has considerable and welldeserved attention.
The New York Times article starts out with:
Like other Silicon Valley giants, Facebook has faced criticism over whether its work force and board are too white and too male. Last year, the social media behemoth started a new push on diversity in hiring and retention.
Now, it is extending its efforts into another corner: the outside lawyers who represent the company in legal matters.
Facebook is requiring that women and ethnic minorities account for at least 33 percent of law firm teams working on its matters.
The Wall Street Journal article says:
The tech industry has been under fire for years over the large percentage of white and Asian male employees and executives. Tech firms have started initiatives to try to combat the trend, but few have shown much progress.
The industry is now under scrutiny from the Labor Department for the issue. The department sued software giant Oracle Corp. earlier this year for allegedly paying white male workers more than other employees. Oracle said at the time of the suit that the complaint was politically motivated, based on false allegations, and without merit.
These articles discuss important issues that need to be addressed in the tech industry. However, I would also like to gently bring up some other points that I think should be considered in tandem.

The first is to clearly identify Asians (and multiracials^{1}) as either belonging to a minority group or not. To its credit, the Wall Street Journal article states this when including Asians among the “large percentage of employees”, but I often see this fact elided in favor of just “white males.” This is a broader issue which also arises when debating about affirmative action. Out of curiosity, I opened up the Supreme Court’s opinions on Fisher v. University of Texas at Austin (PDF link) and did a search for the word “Asians”, which appears 66 times. Only four of those instances appear in the majority opinion written by Justice Kennedy supporting raceconscious admission; the other 62 occurrences of “Asians” are in in Justice Alito’s dissent.

The second is to suggest that there are people who have good reason to believe that they would substantially contribute to workplace diversity, or who have had to overcome considerable life challenges (which I argue also increases work diversity), but who might otherwise not be considered a minority. For instance, suppose a recent refugee from Syria with some computer programming background applied to work at Google. If I were managing a hiring committee and I knew of the applicant’s background information, I would be inspired and would hold him to a slightly lower standard as other applicants, even if he happened to be white and male. There are other possibilities, and one could argue that poor whites or people who are disabled should qualify.

The third is to identify that there is a related problem in the tech industry about the pool of qualified employees to begin with. If the qualified applicants to tech jobs follow a certain distribution of the overall population, then the most likely outcome is that the people who get hired mirror that distribution. Thus, I would encourage emphasis on rephrasing the argument as follows: “tech companies have been under scrutiny for having a workforce which consists of too many white and Asian males with respect to the population distribution of qualified applicants” (emphasis mine). The words “qualified applicants” might be loaded, though. Tech companies often filter students based on school because that is an easy and accurate way to identify the top students, and in some schools (such as the one I attend, for instance), the proportion of underrepresented minorities as traditionally defined has remained stagnant for decades.
I don’t want to sound insensitive to the need to make the tech workforce more diverse. Indeed, that’s the opposite of what I feel, and I think (though I can’t say for sure) that I would be more sensitive to the needs of underrepresented minorities given my frequent experience of feeling like an outcast among my classmates and colleagues.^{2} I just hope that my alternative perspective is compatible with increasing diversity and can work alongside — rather than against — the prevailing view.

See my earlier blog post about this. ↩

I also take offense at the stereotype of the computer scientist as a “shy, nerdy, antisocial male” and hope that it gets eradicated. I invite the people espousing this stereotype to live in my shoes for a day. ↩
Sir Tim BernersLee Wins the Turing Award
The news is out that Sir Tim BernersLee has won the 2016 Turing Award, the highest honor in computer science. (Turing Award winners are usually announced a few months after the actual year of the award.) He is best known for inventing the World Wide Web, as clearly highlighted by the ACM’s citation:
For inventing the World Wide Web, the first web browser, and the fundamental protocols and algorithms allowing the Web to scale.
(You can also find more information about some of his work on his personal website, where he has some helpful FAQs.)
My first reaction to reading the news was: he didn’t already have a Turing Award?!? I actually thought he had been a cowinner with Vinton Cerf and Robert Kahn, but nope. At least he’s won it now, so we won’t be asking Quora posts like this one anymore.
I’m rather surprised that this announcement wasn’t covered by many mainstream newspapers. I tried searching for something in the New York Times, but nothing showed up. This is rather a shame, because if we think of inventing the World Wide Web as the “bar” for the Turing Award, then that’s a pretty high bar.
My prediction for the winner was actually Geoffrey Hinton, but I can’t argue with Sir Tim BernersLee. (Thus, Hinton is going to be my prediction for the 2017 award.) Just like Terrence Tao for the Fields Medalist, Steven Weinberg for the Nobel Prize in Physics, Merrick Garland for the Supreme Court, and so on, they’re so utterly qualified that I can’t think of a reason to oppose them.
Notes on the Generalized Advantage Estimation Paper
This post serves as a continuation of my last post on the fundamentals of policy gradients. Here, I continue it by discussing the Generalized Advantage Estimation (arXiv link) paper from ICLR 2016, which presents and analyzes more sophisticated forms of policy gradient methods.
Recall that raw policy gradients, while unbiased, have high variance. This paper proposes ways to dramatically reduce variance, but this unfortunately comes at the cost of introducing bias, so one needs to be careful before applying tricks like this in practice.
The setting is the usual one which I presented in my last post, and we are indeed trying to maximize the sum of rewards (assume no discount). I’m happy that the paper includes a concise set of notes summarizing policy gradients:
If the above is not 100% clear to you, I recommend reviewing the basics of policy gradients. I covered five of the six forms of the function in my last post; the exception is the temporal difference residual, but I will go over these later here.
Somewhat annoyingly, they use the infinitehorizon setting. I find it easier to think about the finite horizon case, and I will clarify if I’m assuming that.
Proposition 1: Just Estimators.
One of the first things they prove is Proposition 1, regarding “just” advantage estimators. (The word “just” seems like an odd choice here, but I’m not complaining.) Suppose is an estimate of the advantage function. A just estimator (of the advantage function) results in
This is for one time step . If we sum over all time steps, by linearity of expectation we get
In other words, we get an unbiased estimate of the discounted gradient. Note, however, that this discounted gradient is different from the gradient of the actual function we’re trying to optimize, since that was for the undiscounted rewards. The authors emphasize this in a footnote, saying that they’ve already introduced bias by even assuming the use of a discount factor. (I’m somewhat pleased at myself for catching this in advance.)
The proof for Proposition 1 is based on proving it for one time step , which is all that is needed. The resulting term with in it splits into two terms due to linearity of expectation, one with the function and another with the baseline. The second term is zero due to the baseline causing the expectation to zero, which I derived in my previous post in the finitehorizon case. (I’m not totally sure how to do this in the infinite horizon case, due to technicalities involving infinity.)
The first term is unfortunately a little more complicated. Let me use the finite horizon for simplicity so that I can easily write out the definition. They argue in the proof that:
Most of this proceeds by definitions of expectations and then “pushing” integrals into their appropriate locations. Unfortunately, I am unable to figure out how they did step (i). Specifically, I don’t see how the integral over somehow “moves past” the term. Perhaps there is some trickery with the law of iterated expectation due to conditionals? If anyone else knows why and is willing to explain with detailed math somewhere, I would really appreciate it.
For now, I will assume this proposition to be true. It is useful because if we are given the form of estimator of the advantage, we can immediately tell if it is an unbiased advantage estimator.
Advantage Function Estimators
Now assume we have some function which attempts to approximate the true value function (or in the undiscounted setting).

Note I: is not the true value function. It is only our estimate of it, so . I added in the subscript to indicate that we use a function, such as a neural network, to approximate the value. The weights of the neural network are entirely specified by .

Note II: we also have our policy parameterized by parameters , again typically a neural network. For now, assume that and are separate parameters; the authors mention some enticing future work where one can share parameters and jointly optimize. The combination of and with a policy estimator and a value function estimator is known as the actorcritic model with the policy as the actor and the value function as the critic. (I don’t know why it’s called a “critic” because the value function acts more like an “assistant”.)
Using , we can derive a class of advantage function estimators as follows:
These take on the form of temporal difference estimators where we first estimate the sum of discounted rewards and then we subtract the value function estimate of it. If , meaning that is exact, then all of the above are unbiased estimates for the advantage function. In practice, this will not be the case, since we are not given the value function.
The tradeoff here is that the estimators with small have low variance but high bias, whereas those with large have low bias but high variance. Why? I think of it based on the number of terms. With small , we have fewer terms to sum over (which means low variance). However, the bias is relatively large because it does not make use of extra “exact” information with for . Here’s another way to think of it as emphasized in the paper: is constant among the estimator class, so it does not affect the relative bias or variance among the estimators: differences arise entirely due to the step returns.
One might wonder, as I originally did, how to make use of the step returns in practice. In Qlearning, we have to update the parameters (or the “table”) after each current reward, right? The key is to let the agent run for steps, and then update the parameters based on the returns. The reason why we update parameters “immediately” in ordinary Qlearning is simply due to the definition of Qlearning. With longer returns, we have to keep the Qvalues fixed until the agent has explored more. This is also emphasized in the A3C paper from DeepMind, where they talk about step Qlearning.
The Generalized Advantage Estimator
It might not be so clear which of these estimators above is the most useful. How can we compute the bias and variance?
It turns out that it’s better to use all of the estimators, in a clever way. First, define the temporal difference residual . Now, here’s how the Generalized Advantage Estimator is defined:
To derive this, one simply expands the definitions and uses the geometric series formula. The result is interesting to interpret: the exponentiallydecayed sum of residual terms.
The above describes the estimator for where adjusting adjusts the biasvariance tradeoff. We usually have due to the number of terms in the summation (more terms usually means higher variance), but the bias relationship is reversed. The other parameter, , also adjusts the biasvariance tradeoff … but for the GAE analysis it seems like the part is more important. Admittedly, it’s a bit confusing why we need to have both and (after all, we can absorb them into one constant, right?) but as you can see, the constants serve different roles in the GAE formula.
To make a long story short, we can put the GAE in the policy gradient estimate and we’ve got our biased estimate (unless ) of the discounted gradient, which again, is itself biased due to the discount. Will this work well in practice? Stay tuned …
Reward Shaping Interpretation
Reward shaping originated from a 1999 ICML paper, and refers to the technique of transforming the original reward function into a new one via the following transformation with an arbitrary realvalued function on the state space:
Amazingly, it was shown that despite how is arbitrary, the reward shaping transformation results in the same optimal policy and optimal policy gradient, at least when the objective is to maximize discounted rewards . I am not sure whether the same is true with the undiscounted case as they have here, but it seems like it should since we can set .
The more important benefit for their purposes, it seems, is that this reward shaping leaves the advantage function invariant for any policy. The word “invariant” here means that if we computed the advantage function for a policy and a discount factor in some MDP, the transformed MDP would have some advantage function , but we would have (nice!). This follows because if we consider the discounted sum of rewards starting at state in the transformed MDP, we get
“Hitting” the above values with expectations (as Michael I. Jordan would say it) and substituting appropriate values results in the desired equality.
The connection between reward shaping and the GAE is the following: suppose we are trying to find a good policy gradient estimate for the transformed MDP. If we try to maximize the sum of discounted sum of (transformed) rewards and set , we get precisely the GAE! With here, we have , the residual term defined earlier.
To analyze the tradeoffs with and , they use a response function:
Why is this important? They state it clearly:
The response function lets us quantify the temporal credit assignment problem: long range dependencies between actions and rewards correspond to nonzero values of the response function for .
These “longrange dependencies” are the most challenging part of the credit assignment problem. Then here’s the kicker: they argue that if , then the transformed rewards are such that for . Thus, longrange rewards have to induce an immediate response! I’m admittedly not totally sure if I understand this, and it seems odd that we only want the response function to be nonzero at the current time (I mean, some rewards have to be merely a few steps in the future, right?). I will take another look at this section if I have time.
Value Function Estimation
In order to be able to use the GAE in our policy gradient algorithm (again, this means computing gradients and shifting the weights of the policy to maximize an objective), we need some value function parameterized by a neural network. This is part of the actorcritic framework, where the “critic” provides the value function estimate.
Let be the discounted sum of rewards. The authors propose the following optimization procedure to find the best weights :
where each iteration, is the parameter vector before the update, and
This is a constrained optimization problem to find the best weights for the value function. The constraint reminds me of Trust Region Policy Optimization, because it limits the amount that can change from one update to another. The advantages with a “trust region” method are that the weights don’t change too much and that they don’t overfit to the current batch. (Updates are done in batch mode, which is standard nowadays.)

Note I: unfortunately, the authors don’t use this optimization procedure exactly. They use a conjugate gradient method to approximate it. But think of the optimization procedure here since it’s easier to understand and is “ideal.”

Note II: remember that this is not the update to the policy . That update requires an entirely separate optimization procedure. Don’t get confused between the two. Both the policy and the value functions can be implemented as neural networks, and in fact, that’s what the authors do. They actually have the same architecture, with the exception of the output layer since the value only needs a scalar, whereas the policy needs a higherdimensional output vector.
Putting it All Together
It’s nice to understand each of the components above, but how do we combine them into an actual algorithm? Here’s a rough description of their proposed actorcritic algorithm, each iteration:

Simulate the current policy to collect data.

Compute the Bellman residuals .

Compute the advantage function estimate .

Update the policy’s weights, , with a TRPO update.

Update the critic’s weights, , with a trustregion update.
As usual, here are a few of my overlydetailed comments (sorry again):

Note I: Yes, there are trust region methods for both the value function update and the policy function update. This is one of their contributions. (To be clear, the notion of a “GAE” isn’t entirely their contribution.) The value and policy are also both neural networks with the same architecture except for the output since they have different outputs. Honestly, it seems like we should always be thinking about trust region methods whenever we have some optimization to do.

Note II: If you’re confused by the role of the two networks, repeat this to yourself: the policy network is for determining actions, and the value network is for improving the performance of the gradient update (which is used to improve the actual policy by pointing the gradient in the correct direction!).
They present some impressive experimental benchmarks using this actorcritic algorithm. I don’t have too much experience with MuJoCo so I can’t intuitively think about the results that much. (I’m also surprised that MuJoCo isn’t free and requires payment; it must be by far the best physic simulator for reinforcement learning, otherwise people wouldn’t be using it.)
Concluding Thoughts
I didn’t understand the implications of this paper when I read it for the first time (maybe more than a year ago!) but it’s becoming clearer now. They present and analyze a specific kind of estimator, the GAE, which has a biasvariance “knob” with the (and , technically). By adjusting the knob, it might be possible to get low variance, low biased estimates, which would drastically improve the sample efficiency of policy gradient methods. They also present a way to estimate the value method using a trust region method. With these components, they are able to achieve high performance on challenging reinforcement learning tasks with continuous control.
Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients
As I stated in my last blog post, I am feverishly trying to read more research papers. One category of papers that seems to be coming up a lot recently are those about policy gradients, which are a popular class of reinforcement learning algorithms which estimate a gradient for a function approximator. Thus, the purpose of this blog post is for me to explicitly write the mathematical foundations for policy gradients so that I can gain understanding. In turn, I hope some of my explanations will be useful to a broader audience of AI students.
Assumptions and Problem Statement
In any type of research domain, we always have to make some set of assumptions. (By “we”, I refer to the researchers who write papers on this.) With reinforcement learning and policy gradients, the assumptions usually mean the episodic setting where an agent engages in multiple trajectories in its environment. As an example, an agent could be playing a game of Pong, so one episode or trajectory consists of a full starttofinish game.
We define a trajectory of length as
where comes from the starting distribution of states, , and with the dynamics model (i.e. how the environment changes). We actually ignore the dynamics when optimizing, since all we care about is getting a good gradient signal for to make it better. If this isn’t clear now, it will be clear soon. Also, the reward can be computed from the states and actions, since it’s usually a function of , so it’s not technically needed in the trajectory.
What’s our goal here with policy gradients? Unlike algorithms such as DQN, which strive to find an excellent policy indirectly through Qvalues, policy gradients perform a direct gradient update on a policy to change its parameters, which is what makes it so appealing. Formally, we have:

Note I: I put under the expectation. This means the rewards are computed from a trajectory which was generated under the policy . We have to find “optimal” settings of to make this work.

Note II: we don’t need to optimize the expected sum of discounted rewards, though it’s the formulation I’m most used to. Alternatives include ignoring by setting it to one, extending to infinity if the episodes are infinitehorizon, and so on.
The above raises the allimportant question: how do we find the best ? If you’ve taken optimization classes before, you should know the answer already: perform gradient ascent on , so we have where is the function being optimized. Here, that’s the expected value of whatever sum of rewards formula we’re using.
Two Steps: LogDerivative Trick and Determining Log Probability
Before getting to the computation of the gradient, let’s first review two mathematical facts which will be used later, and which are also of independent interest. The first is the “logderivative” trick, which tells us how to insert a log into an expectation when starting from . Specifically, we have:
where is the density of . Most of these steps should be straightforward. The main technical detail to worry about is exchanging the gradient with the integral. I have never been comfortable in knowing when we are allowed to do this or not, but since everyone else does this, I will follow them.
Another technical detail we will need is the gradient of the log probability of a trajectory since we will later switch from above with a trajectory . The computation of proceeds as follows:
The probability of decomposes into a chain of probabilities by the Markov Decision Process assumption, whereby the next action only depends on the current state, and the next state only depends on the current state and action. To be explicit, we use the functions that we already defined: and for the policy and dynamics, respectively. (Here, represents the starting state distribution.) We also observe that when taking gradients, the dynamics disappear!
Computing the Raw Gradient
Using the two tools above, we can now get back to our original goal, which was to compute the gradient of the expected sum of (discounted) rewards. Formally, let be the reward function we want to optimize (i.e. maximize). Using the above two tricks, we obtain:
In the above, the expectation is with respect to the policy function, so think of it as . In practice, we need trajectories to get an empirical expectation, which estimates this actual expectation.
So that’s the gradient! Unfortunately, we’re not quite done yet. The naive way is to run the agent on a batch of episodes, get a set of trajectories (call it ) and update with using the empirical expectation, but this will be too slow and unreliable due to high variance on the gradient estimates. After one batch, we may exhibit a wide range of results: much better performance, equal performance, or worse performance. The high variance of these gradient estimates is precisely why there has been so much effort devoted to variance reduction techniques. (I should also add from personal research experience that variance reduction is certainly not limited to reinforcement learning; it also appears in many statistical projects which concern a biasvariance tradeoff.)
How to Introduce a Baseline
The standard way to reduce the variance of the above gradient estimates is to insert a baseline function inside the expectation.
For concreteness, assume , so we have no discounted rewards. We can express the policy gradient in three equivalent, but perhaps nonintuitive ways:
Comments:

Step (i) follows from plugging in our chosen into the policy gradient we previously derived.

Step (ii) follows from first noting that . The reason why this is true can be somewhat tricky to identify. I find it easy to think of just redefining as for some fixed timestep . Then, we do the exact same computation above to get the final result, as shown in the equation of the “Computing the Raw Gradient” section. The main difference now is that since we’re considering the reward at time , our trajectory under expectation stops at that time. More concretely, . This is like “throwing away variables” when taking expectations due to “pushing values” through sums and summing over densities (which cancel out); I have another example later in this post which makes this explicit.
Next, we sum over both sides, for . Assuming we can exchange the sum with the gradient, we get
where indicates the trajectory up to time . (Full disclaimer: I’m not sure if this formalism with is needed, and I think most people would do this computation without worrying about the precise expectation details.)

Step (iii) follows from a nifty algebra trick. To simplify the subsequent notation, let . In addition, ignore the expectation; we’ll only rearrange the inside here. With this substitution and setup, the sum inside the expectation from Step (ii) turns out to be
In other words, each has its own row of value to which it gets distributed. Next, switch to the column view: instead of summing rowwise, sum columnwise. The first column is . The second is . And so on. Doing this means we get the desired formula after replacing with its real meaning and hitting the expression with an expectation.
Note: it is very easy to make a typo with these. I checked my math carefully and crossreferenced it with references online (which themselves have typos). If any readers find a typo, please let me know.
Using the above formulation, we finally introduce our baseline , which is a function of (and not , I believe). We “insert” it inside the term in parentheses:
At first glance, it doesn’t seem like this will be helpful, and one might wonder if this would cause the gradient estimate to become biased. Fortunately, it turns out that this is not a problem. This was surprising to me, because all we know is that is a function of . However, this is a bit misleading because usually we want to be the expected return starting at time , which means it really “depends” on the subsequent time steps. For now, though, just think of it as a function of .
Understanding the Baseline
In this section, I first go over why inserting above doesn’t make our gradient estimate biased. Next, I will go over why the baseline reduces variance of the gradient estimate. These two capture the best of both worlds: staying unbiased and reducing variance. In general, any time you have an unbiased estimate and it remains so after applying a variance reduction technique, then apply that variance reduction!
First, let’s show that the gradient estimate is unbiased. We see that with the baseline, we can distribute and rearrange and get:
Due to linearity of expectation, all we need to show is that for any single time , the gradient of multiplied with is zero. This is true because
Here are my usual overlydetailed comments (apologies in advance):

Note I: this notation is similar to what I had before. The trajectory is now represented as . In addition, the expectation is split up, which is allowed. If this is confusing, think of the definition of the expectation with respect to at least two variables. We can write brackets in any appropriately enclosed location. Furthermore, we can “omit” the unnecessary variables in going from to (see expression above). Concretely, assuming we’re in discreteland with actions in and states in , this is because evaluates to:
This is true because of the definition of expectation, whereby we get the joint density over the entire trajectory, and then we can split it up like we did earlier with the gradient of the log probability computation. We can distribute all the way back to (but not beyond) the first sum over . Pushing sums “further back” results in a bunch of sums over densities, each of which sums to one. The astute reader will notice that this is precisely what happens with variable elimination for graphical models. (The more technical reason why “pushing values back through sums” is allowed has to do with abstract algebra properties of the sum function, which is beyond the scope of this post.)

Note II: This proof above also works with an infinitetime horizon. In Appendix B of the Generalized Advantage Estimation paper (arXiv link), the authors do so with a proof exactly matching the above, except that and are now infinity.

Note III: About the expectation going to zero, that’s due to a wellknown fact about score functions, which are precisely the gradient of log probailities. We went over this in my STAT 210A class last fall. It’s again the log derivative trick. Observe that:
where the penultimate step follows from how is a density. This follows for all time steps, and since the gradient of the log gets distributed for each , it applies in all time steps. I switched to the continuousland version for this, but it also applies with sums, as I just recently used in Note I.
The above shows that introducing doesn’t cause bias.
The last thing to cover is why its introduction reduces variance. I provide an approximate argument. To simplify notation, set . We focus on the inside of the expectation (of the gradient estimate) to analyze the variance. The technical reason for this is that expectations are technically constant (and thus have variance zero) but in practice we have to approximate the expectations with trajectories, and that has high variance.
The variance is approximated as:
Approximation (i) is because we are approximating the variance of a sum by computing the sum of the variances. This is not true in general, but if we can assume this, then by the definition of the variance , we are left with the term since we already showed that introducing the baseline doesn’t cause bias. Approximation (ii) is because we assume independence among the values involved in the expectation, and thus we can factor the expectation.
Finally, we are left with the term . If we are able to optimize our choice of , then this is a least squares problem, and it is well known that the optimal choice of is to be the expected value of . In fact, that’s why policy gradient researchers usually want to approximate the expected return starting at time , and that’s why in the vanilla policy gradient algorithm we have to refit the baseline estimate each time to make it as close to the expected return . At last, I understand.
How accurate are these approximations in practice? My intuition is that they are actually fine, because recent advances in reinforcement learning algorithms, such as A3C, focus on the problem of breaking correlation among samples. If the correlation among samples is broken, then Approximation (i) becomes better, because I think the samples are no longer generated from the same trajectory.
Well, that’s my intuition. If anyone else has a better way of describing it, feel free to let me know in the comments or by email.
Discount Factors
So far, we have assumed we wanted to optimize the expected return, or the expected sum of rewards. However, if you’ve studied value iteration and policy iteration, you’ll remember that we usually use discount factors . These empirically work well because the effect of an action many time steps later is likely to be negligible compared to other action. Thus, it may not make sense to try and include raw distant rewards in our optimization problem. Thus, we often impose a discount as follows:
where the serves as the discount, starting from 1, then getting smaller as time passes. (The first line above is a repeat of the policy gradient formula that I describe earlier.) As this is not exactly the “desired” gradient, this is an approximation, but it’s a reasonable one. This time, we now want our baseline to satisfy .
Advantage Functions
In this final section, we replace the policy gradient formula with the following value functions:
Both of these should be familiar from basic AI; see the CS 188 notes from Berkeley if this is unclear. There are also discounted versions, which we can denote as and . In addition, we can also consider starting at any given time step, as in which provides the expected (discounted) return assuming that at time , our stateaction pair is .
What might be new is the advantage function. For the undiscounted version, it is defined simply as:
with a similar definition for the discounted version. Intuitively, the advantage tells us how much better action would be compared to the return based on an “average” action.
The above definitions look very close to what we have in our policy gradient formula. In fact, we can claim the following:
In (i), we replace terms with their expectations. This is not generally valid to do, but it should work in this case. My guess is that if you start from the second line above (after the “(i)”) and plug in the definition of the expectation inside and rearrange terms, you can get the first line. However, I have not had the time to check this in detail and it takes a lot of space to write out the expectation fully. The conditioning with the value functions makes it a bit messy and thus the law of iterated expectation may be needed.
Also from line (i), we notice that the value function is a baseline, and hence we can add it there without changing the unbiasedness of the expectation. Then lines (ii) and (iii) are just for the advantage function. The implication of this formula is that the problem of policy gradients, in some sense, reduces to finding good estimates of the advantage function . That is precisely the topic of the paper Generalized Advantage Estimation.
Concluding Remarks
Hopefully, this is a helpful, selfcontained, bareminimum introduction to policy gradients. I am trying to learn more about these algorithms, and going through the math details is helpful. This will also make it easier for me to understand the increasing number of research papers that are using this notation.
I also have to mention: I remember a few years ago during the first iteration of CS 294112 that I had no idea how policy gradients worked. Now, I think I have become slightly more enlightened.
Acknowledgements: I thank John Schulman for making his notes publicly available.
Update April 19, 2017: I have code for vanilla policy gradients in my reinforcement learning GitHub repository.
Keeping Track of Research Articles: My Paper Notes Repository
The number of research papers in Artificial Intelligence has reached unmanageable proportions. Conferences such as ICML, NIPS, and ICLR others are getting record amounts of paper submissions. In addition, tens of AIrelated papers get uploaded to arXiv every weekday. With all these papers, it can be easy to feel lost and overwhelmed.
Like many researchers, I think I do not read enough research papers. This year, I resolved to change that, so I started an opensource GitHub repository called “Paper Notes” where I list papers that I’ve read along with my personal notes and summaries, if any. Papers without such notes are currently on my TODO radar.
After almost three months, I’m somewhat pleased with my reading progress. There are a healthy number of papers (plus notes) listed, arranged by subject matter and then further arranged by year. Not enough for me, but certainly not terrible either.
I was inspired to make this by seeing Denny Britz’s similar repository, along with Adrian Colyer’s blog. My repository is similar to Britz’s, though my aim is not to list all papers in Deep Learning, but to write down the ones that I actually plan to read at some point. (I see other repositories where people simply list Deep Learning papers without notes, which seems pretty pointless to me.) Colyer’s blog posts represent the kind of notes that I’d like to take for each paper, but I know that I can’t dedicate that much time to finetuning notes.
Why did I choose GitHub as the backend for my paper management, rather than something like Mendeley? First, GitHub is the default place where (pretty much) everyone in AI puts their opensource stuff: blogs, code, you name it. I’m already used to GitHub, so Mendeley would have to provide some serious benefit for me to switch over. I also don’t need to use advanced annotation and organizing materials, given that the top papers are easily searchable online (including their BibTeX references). In addition, by making my Paper Notes repository online, I can show this as evidence to others that I’m reading papers. Maybe this will even impress a few folks, and I say this only because everyone wants to be noticed in some way; that’s partly Colyer’s inspiration for his blog. So I think, on balance, it will be useful for me to keep updating this repository.
What Biracial People Know
There’s an opinion piece in the New York Times by Moises VelasquezManoff which talks about (drum roll please) biracial people. As he mentions:
Multiracials make up an estimated 7 percent of Americans, according to the Pew Research Center, and they’re predicted to grow to 20 percent by 2050.
Thus, I suspect that sometime in the next few decades, we will start talking about race in terms of precise racial percentages, such as “100 percent White” or in rarer cases, “25 percent White, 25 percent Asian, 25 percent Black, and 25 percent Native American.” (Incidentally, I’m not sure why the article uses “Biracial” when “Multiracial” would clearly have been a more appropriate term; it was likely due to the Barack Obama factor.)
The phrase “precise racial percentages” is misleading. Since all humans came from the same ancestor, at some point in history we must have been “one race.” For the sake of defining these racial percentages, we can take a date — say 4000BC — when, presumably, the various races were sufficiently different, ensconced in their respective geographic regions, and when interracial marriages (or rape) was at a minimum. All humans alive at that point thus get a “100 percent [insert_race_here]” attached to them, and we do the arithmetic from there.
What usually happens in practice, though, is that we often default to describing one part of one race, particularly with people who are percent Black, where . This is a relic of the embarrassing “One Drop Rule” the United States had, but for now it’s probably — well, I hope — more for selfselecting racial identity.
Listing precise racial percentages would help us better identify people who are not easy to immediately peg in racial categories, which will increasingly become an issue as more and more multiracial people like me blur the lines between the races. In fact, this is already a problem for me even with singlerace people: I sometimes cannot distinguish between Hispanics versus Whites. For instance, I thought Ted Cruz and Marco Rubio were 100 percent White.
Understanding race is also important when considering racial diversity and various ethical or sensitive questions over who should get “preferences.” For instance, I wonder if people label me as a “privileged white male” or if I get a pass for being biracial? Another question: for a job at a firm which has had a history of racial discrimination and is trying to make up for that, should the applicant who is 75 percent Black, 25 percent White, get a hair’s preference versus someone who is 25 percent Black and 75 percent White? Would this also apply if they actually have very similar skin color?
In other words, does one weigh more towards the looks or the precise percentages? I think the precise percentages method is the way schools, businesses, and government operate, despite how this isn’t the case in casual conversations.
Anyway, these are some of the thoughts that I have as we move towards a more racially diverse society, as multiracial people cannot have singlerace children outside of adoption.
Back to the article: as one would expect, it discusses the benefits of racial diversity. I can agree with the following passage:
Social scientists find that homogeneous groups like [Donald Trump’s] cabinet can be less creative and insightful than diverse ones. They are more prone to groupthink and less likely to question faulty assumptions.
The caveat is that this assumes the people involved are equally qualified; a racially homogeneous (in whatever race), but extremely welleducated cabinet would be much better than a racially diverse cabinet where no one even finished high school. But controlling for quality, I can agree.
Diversity also benefits individuals, as the author notes. It is here where Mr. VelasquezManoff points out that Barack Obama was not just Black, but also biracial, which may have benefited his personal development. Multiracials make up a large fraction of the population in racially diverse Hawaii, where Obama was born (albeit, probably with more AsianWhite overlap).
Yes, I agree that diversity is important for a variety of reasons. It is not easy, however:
It’s hard to know what to do about this except to acknowledge that diversity isn’t easy. It’s uncomfortable. It can make people feel threatened. “We promote diversity. We believe in diversity. But diversity is hard,” Sophie Trawalter, a psychologist at the University of Virginia, told me.
That very difficulty, though, may be why diversity is so good for us. “The pain associated with diversity can be thought of as the pain of exercise,” Katherine Phillips, a senior vice dean at Columbia Business School, writes. “You have to push yourself to grow your muscles.”
I cannot agree more.
Moving on:
Closer, more meaningful contact with those of other races may help assuage the underlying anxiety. Some years back, Dr. Gaither of Duke ran an intriguing study in which incoming white college students were paired with either samerace or differentrace roommates. After four months, roommates who lived with different races had a more diverse group of friends and considered diversity more important, compared with those with samerace roommates. After six months, they were less anxious and more pleasant in interracial interactions.
Ouch, this felt like a blindsiding attack, and is definitely my main gripe with this article. In college, I had two roommates, both of whom have a different racial makeup than me. They both seemed to be relatively popular and had little difficulty mingling with a diverse group of students. Unfortunately, I certainly did not have a “diverse group of friends.” After all, if there was a prize for college for “least popular student” I would be a perennial contender. (As incredible as it may sound, in high school, where things were worse for me, I can remember a handful of people who might have been even lower on the social hierarchy.)
Well, I guess what I want to say is that, this attack notwithstanding, Mr. VelasquezManoff’s article brings up interesting and reasonably accurate points about biracial people. At the very least, he writes about concepts which are sometimes glossed over or underappreciated nowadays in our discussions about race.
Understanding Generative Adversarial Networks
Over the last few weeks, I’ve been learning more about some mysterious thing called Generative Adversarial Networks (GANs). GANs originally came out of a 2014 NIPS paper (read it here) and have had a remarkable impact on machine learning. I’m surprised that, until I was the TA for Berkeley’s Deep Learning class last semester, I had never heard of GANs before.^{1}
They certainly haven’t gone unnoticed in the machine learning community, though. Yann LeCun, one of the leaders in the Deep Learning community, had this to say about them during his Quora session on July 28, 2016:
The most important one, in my opinion, is adversarial training (also called GAN for Generative Adversarial Networks). This is an idea that was originally proposed by Ian Goodfellow when he was a student with Yoshua Bengio at the University of Montreal (he since moved to Google Brain and recently to OpenAI).
This, and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.
If he says something like that about GANs, then I have no excuse for not learning about them. Thus, I read what is probably the highestquality general overview available nowadays: Ian Goodfellow’s tutorial on arXiv, which he then presented in some form at NIPS 2016. This was really helpful for me, and I hope that later, I can write something like this (but on another topic in AI).
I won’t repeat what GANs can do here. Rather, I’m more interested in knowing how GANs are trained. Following now are some of the most important insights I gained from reading the tutorial:

Major Insight 1: the discriminator’s loss function is the cross entropy loss function. To understand this, let’s suppose we’re doing some binary classification with some trainable function that we wish to optimize, where indicates the estimated probability of some data point being in the first class. To get the predicted probability of being in the second class, we just do . The output of must therefore be constrained in , which is easy to do if we tack on a sigmoid layer at the end. Furthermore, let be the inputlabel pairing for training data points.
The cross entropy between two distributions, which we’ll call and , is defined as
where and denote a “true” and an “empirical/estimated” distribution, respectively. Both are discrete distributions, hence we can sum over their individual components, denoted with . (We would need to have an integral instead of a sum if they were continuous.)
To apply this loss function to the current binary classification task, we define the true distribution as if , or if . Putting in 2D vector form, it’s either or . Intuitively, we know for sure which class this belongs to, so it makes sense for a probability distribution to be a “onehot” vector.
Thus, for one data point and its label, we get the following loss function, where here I’ve changed the input to be more precise:
Let’s look at the above function. Notice that only one of the two terms is going to be zero, depending on the value of , which makes sense since it’s defining a distribution which is either or . The other part is the estimated distribution from . In both cases (the true and predicted distributions) we are encoding a 2D distribution with one value, which lets us treat as a realvalued function.
That was for one data point. Summing over the entire dataset of elements, we get something that looks like this:
In the case of GANs, we can say a little more about what these terms mean. In particular, our s only come from two sources: either , the true data distribution, or where , the generator’s distribution, based on some input code . It might be but we will leave it unspecified.
In addition, we also want exactly half of the data to come from these two sources.
To apply this to the sum above, we need to encode this probabilistically, so we replace the sums with expectations, the labels with , and we can furthermore replace the term with under some sampled code for the generator. We get
This is precisely the loss function for the discriminator, .

Major Insight 2: understanding how gradient saturation may or may not adversely affect training. Gradient saturation is a general problem when gradients are too small (i.e. zero) to perform any learning. See Stanford’s CS 231n notes on gradient saturation here for more details. In the context of GANs, gradient saturation may happen due to poor design of the generator’s loss function, so this “major insight” of mine is also based on understanding the tradeoffs among different loss functions for the generator. This design, incidentally, is where we can be creative; the discriminator needs the cross entropy loss function above since it has a very specific function (to discriminate among two classes) and the cross entropy is the “best” way of doing this.
Using Goodfellow’s notation, we have the following candidates for the generator loss function, as discussed in the tutorial. The first is the minimax version:
The second is the heuristic, nonsaturating version:
Finally, the third is the maximum likelihood version:
What are the advantages and disadvantages of these generator loss functions? For the minimax version, it’s simple and allows for easier theoretical results, but in practice its not that useful, due to gradient saturation. As Goodfellow notes:
In the minimax game, the discriminator minimizes a crossentropy, but the generator maximizes the same crossentropy. This is unfortunate for the generator, because when the discriminator successfully rejects generator samples with high confidence, the generator’s gradient vanishes.
As suggested in Chapter 3 of Michael Nielsen’s excellent online book, the crossentropy is a great loss function since it is designed in part to accelerate learning and avoid gradient saturation only up to when the classifier is correct (since we don’t want the gradient to move in that case!).
I’m not sure how to clearly describe this formally. For now, I will defer to Figure 16 in Goodfellow’s tutorial (see the top of this blog post), which nicely shows the value of as a function of the discriminator’s output, . Indeed, when the discriminator is winning, we’re at the left side of the graph, since the discriminator outputs the probability of the sample being from the true data distribution.
By the way, why is only a function of as suggested by the figure? What about the other term in ? Notice that of the two terms in the loss function, the first one is only a function of the discriminator’s parameters! The second part, which uses the term, depends on both and . Hence, for the purposes of performing gradient descent with respect to the parameters of , only the second term in matters; the first term is a constant that disappears after taking derivatives .
The figure makes it clear that the generator will have a hard time doing any sort of gradient update at the left portion of the graph, since the derivatives are close to zero. The problem is that the left portion of the graph represents the most common case when starting the game. The generator, after all, starts out with basically random parameters, so the discriminator can easily tell what is real and what is fake.^{2}
Let’s move on to the other two generator cost functions. The second one, the heuristicallymotivated one, uses the idea that the generator’s gradient only depends on the second term in . Instead of flipping the sign of , they instead flip the target: changing to . In other words, the “sign flipping” happens at a different part, so the generator still optimizes something “opposite” of the discriminator. From this reformulation, it appears from the figure above that now has desirable gradients in the left portion of the graph. Thus, the advantage here is that the generator gets a strong gradient signal so that it can quickly improve. The downside is that it’s not easier to analyze, but who cares?
Finally, the maximum likelihood cost function has the advantage of being motivated based on maximum likelihood, which by itself has a lot of desirable properties. Unfortunately, the figure above shows that it has a flat slope in the left portion, though it seems to be slightly better than the minimax version since it decreases rapidly “sooner.” Though that might not be an “advantage,” since Goodfellow warns about high variance. That might be worth thinking about in more detail.
One last note: the function , at least for the three cost functions here, does not depend directly on at all! That’s interesting … and in fact, Goodfellow argues that makes GANs resistant to overfitting since it can’t copy from .
I wish more tutorials like this existed for other AI concepts. I particularly enjoyed the three exercises and the solutions within this tutorial on GANs. I have more detailed notes here in my Paper Notes GitHub repository (I should have started this repository back in 2013). I highly recommend this tutorial to anyone wanting to know more about GANs.

Ian Goodfellow, the lead author on the GANs paper, was a guest lecture for the class, where (obviously) he talked about GANs. ↩

Actually, the discriminator also starts out random, right? I think the discriminator has an easier job, though, since supervised learning is easier than generating realistic images (I mean, c’mon??) so perhaps the discriminator simply learns faster, and the generator has to spend a lot of time catching up. ↩
My Thoughts on CS 231n Being Forced To Take Down Videos
CS 231n: Convolutional Neural Networks for Visual Recognition is, in my biased opinion, one of the most important and thrilling courses offered by Stanford University. It has been taught twice so far and will appear again in the upcoming Spring quarter.
Due to its popularity, the course lectures for the second edition (Winter 2016) were videotaped and released online. This is not unusual among computer science graduate level courses due to high demand both inside and outside the university.
Unfortunately, as discussed in this rather large reddit discussion thread, Andrej Karpathy (one of the three instructors) was forced to pull down the lecture videos. He later clarified on his Twitter account that the reason had to do with the lack of captioning/subtitles in the lecture videos, which relates to a news topic I blogged about just over two years ago.
If you browse the reddit thread, you will see quite a lot of unhappy students. I just joined reddit and I was hoping to make a comment there, but reddit disables posting after six months. And after thinking about it, I thought it would make more sense to write some brief thoughts here instead.
To start, I should state upfront that I have no idea what happened beyond the stuff we can all read online. I don’t know who made the complaint, what the course staff did, etc.
Here’s my stance regarding class policies on watching videos:
If a class requires watching videos for whatever reason, then that video should have subtitles. Otherwise, no such action is necessary, though the course staff should attempt as much as is reasonable to have subtitles for all videos.
I remember two times when I had to face this problem of watching a nonsubtitled video as a homework assignment: in an introductory Women’s, Gender, and Sexuality Studies course and an Africana Studies class about black athletes. For the former, we were assigned to watch a video about a transgender couple, and for the latter, the video was about black golfers. In both cases, the professors gave me copies of the movie (other students didn’t get these) and I watched one in a room myself with the volume cranked up and the other one with another person who told me what was happening.
Is that ideal? Well, no. To (new) readers of this blog, welcome to the story of my life!
More seriously, was I supposed to do something about it? The professors didn’t make the videos, which were a tiny portion of the overall courses. I didn’t want to get all up in arms about this, so in both cases, I brought it up with them and they understood my situation (and apologized).
Admittedly, my brief stance above is incomplete and belies a vast gray area. What if students are given the option of doing one of two “required” assignments: watching a video or reading a book? That’s a gray area, though I would personally lean that towards “required viewing” and thus “required subtitles.”
Class lecture videos also fall in a gray area. They are not required viewing, because students should attend lectures in person. Unfortunately, the lack of subtitles for these videos definitely puts deaf and hard of hearing students like myself at a disadvantage. I’ve lost count of the amount of lectures that I wish I could have rewatched, but it extraordinarily difficult for me to do so for nonsubtitled videos.
Ultimately, however, as long as I can attend lectures and understand some of the material, I do not worry about whether lecture videos have subtitles. Just about every videotaped class that I have taken did not have subtitled lecture videos, with one exception: CS 267 from Spring 2016, after I had negotiated about it with Berkeley’s DSP.
Heck, the CS 294129 class which I TAed for last semester — which is based on CS 231n! — had lecture videos. Were there captions? Nope.
Am I frustrated? Yes, but it’s understandable frustration due to the cost of adding subtitles. As a similar example, I’m frustrated at the identity politics practiced by the Democratic party, but it’s understandable frustration due to what political science instructs us to do, which is why I’m not planning to jump ship to another party.
Thus in my case, if I were a student in CS 231n, I would not be inclined to pressure the staff to pull the videos down. Again, this comes with the obvious caveat; I don’t know the situation and it might have been worse than I imagine.
As this discussion would imply, I don’t like pulling down lecture videos as “collateral damage”.^{1} I worry, however, if that’s in part because I’m too timid. Hypothetically and broadly speaking, if I have to take out my frustration (e.g. with lawsuits) on certain things, I don’t want to do this for something like lecture videos, which would make a number of folks angry at me, whether or not they openly express it.
On a more positive note … it turns out that, actually, the CS 231n lecture videos are online! I’m not sure why, but I’m happy. Using YouTube’s automatic captions, I watched one of the lectures and finally understood a concept that was critical and essential for me to know when I was writing my latest technical blog post.
Moreover, the automatic captions are getting better and better each year. They work pretty well on Andrej, who has a slight accent (Russian?). I dislike attending research talks if I don’t understand what’s going on, but given that so many are videotaped these days, whether at Berkeley or at conferences, maybe watching them offline is finally becoming a viable alternative.

In another case where lecture videos had to be removed, consider MIT’s Open Courseware and Professor Walter Lewin’s famous physics lectures. MIT removed the videos after it was found that Lewin had sexually harassed some of his students. Lewin’s harassment disgusted me, but I respectfully disagreed with MIT’s position about removing his videos, siding with thenMIT professor Scott Aaronson. In an infamous blog post, Professor Aaronson explained why he opposed the removal of the videos, which subsequently caused him to be the subject of a haterage/attack. Consequently, I am now a permanent reader of his blog. ↩
These Aren't Your Father's Hearing Aids
I am now wearing Oticon Dynamo hearing aids. The good news is that I’ve run many times with them and so far have not had issues with water resistance.
However, I wanted to bring up a striking point that really made me realize about how our world has changed remarkably in the last few years.
A few months ago, when I was first fitted with the hearing aids, my audiologist set the default volume level to be “on target” for me. The hearing aid is designed to provide different amounts of power to people depending on their raw hearing level. There’s a volume control on it which goes from “1” (weak) to “4” (powerful), which I can easily adjust as I wish. The baseline setting is “3”, but this baseline is what audiologist adjust on a casebycase basis. This means my “3” (and thus, my “1” and “4” settings) may be more powerful, less powerful, or the same compared to the respective settings for someone else.
When my audiologist first fit the hearing aids for me, I felt that my left hearing aid was too quiet and my right one too loud by default, so she modified the baselines.
She also, critically, gave me about a week to adjust to the hearing aids, and I was to report back on whether its strength was correctly set.
During that week, I wore the hearing aids, but I then decided that I was originally mistaken about both hearing aids, since I had to repeatedly increase the volume for the left one and decrease the volume for the right one.
I reported back to my audiologist and said that she was right all along, and that my baselines needed to be back to their default levels. She was able to corroborate my intuition by showing me — amazingly – how often I had adjusted the hearing aid volume level, and in which direction.
Hearing aids are, apparently, now fitted with these advanced sensors so they can track exactly how you adjust them (volume controls or otherwise).
The lesson is that just about everything nowadays consists of sensors, a point which is highlighted in Thomas L. Friedman’s excellent book Thank You for Being Late. It is also a characteristic of what computer scientists refer to as the “Internet of Things.”
Obviously, these certainly aren’t the hearing aids your father wore when he was young.
Academics Against Immigration Executive Order
I just signed a petition, Academics Against Immigration Executive Order to oppose the Trump administration’s recent executive order. You can find the full text here along with the names of those who have signed up. (Graduate students are in the “Other Signatories” category and may take a while to update.) I like this petition because it clearly lists the names of people so as to avoid claims of duplication and/or bogus signatures for anonymous petitions. There are lots of academic superstars on the list, including (I’m proud to say) my current statistics professor Michael I. Jordan and my statistics professor William Fithian from last semester.
The petition lists three compelling reasons to oppose the order, but let me just chime in with some extra thoughts.
I understand the need to keep our country safe. But in order to do so, there has to be a correct tradeoff in terms of security versus profiling (for lack of a better word) and in terms of costs versus benefits.
On the spectrum of security, to one end are those who deny the existence of radical Islam and the impact of religion on terrorism. On the other end are those who would happily ban an entire religion and place the blame and burden on millions of lawabiding people fleeing oppression. This order is far too close to the second end.
In terms of costs and benefits, I find an analogy to policing useful. Mayors and police chiefs shouldn’t be assigning their police officers uniformly throughout cities. The police should be targeted in certain hotspots of crime as indicated by past trends. That’s the most logical and costeffective way to crack down on crime.
Likewise, if were are serious about stopping radical Islamic terrorism, putting a blanket ban on Muslims is like the “uniform policing strategy” and will also cause additional problems since Muslims would (understandably!) feel unfairly targeted. For instance, Iran is already promising “proportional responses”. I also have to mention that the odds of being killed by a refugee terrorist are so low that the amount of anxiety towards them does not justify the cost.
By the way, I’m still waiting for when Saudi Arabia — the source of 15 out of 19 terrorists responsible for 9/11 — gets on the executive order list. I guess President Trump has business dealings there? (Needless to say, that’s why conflict of interest laws exist.)
I encourage American academics to take a look at this order and (hopefully) sign the petition. I also urge our Secretary of Defense, James Mattis, to talk to Trump and get him to rescind and substantially revise the order. While I didn’t state this publicly to anyone, I have more respect for Mattis than any one else in the Trump cabinet, and hopefully that will remain the case.