One of the major difficulties in understanding how neural networks work is due to the backpropagation algorithm. There are endless texts and online guides on backpropagation, but most are useless. I read several explanations of backpropagation when I learned about it from 2013 to 2014, but I never felt like I really understood it until I took/TA-ed the Deep Neural Networks class at Berkeley, based on the excellent Stanford CS 231n course.
The course notes from CS 231n include a tutorial on how to compute gradients for local nodes in computational graphs, which I think is key to understanding backpropagation. However, the notes are mostly for the one-dimensional case, and their main advice for extending gradient computation to the vector or matrix case is to keep track of dimensions. That’s perfectly fine, and in fact that was how I managed to get through the second CS 231n assignment.
But this felt unsatisfying.
For some of the harder gradient computations, I had to test several different ideas before passing the gradient checker, and sometimes I wasn’t even sure why my code worked! Thus, the purpose of this post is to make sure I deeply understand how gradient computation works.
Note: I’ve had this post in draft stage for a long time. However, I just found out that the notes from CS 231n have been updated with a guide from Erik Learned-Miller on taking matrix/vector derivatives. That’s worth checking out, but fortunately, the content I provide here is mostly distinct from his material.
The Basics: Computational Graphs in One Dimension
I won’t belabor the details on one-dimensional graphs since I assume the reader has read the corresponding Stanford CS 231n guide. Another nice post is from Chris Olah’s excellent blog. For my own benefit, I reviewed derivatives on computational graphs by going through the CS 231n example with sigmoids (but with the sigmoid computation spread out among finer-grained operations). You can see my hand-written computations in the following image. Sorry, I have absolutely no skill in getting this up quickly using tikz, Inkscape, or other visualization tactics/software. Feel free to right-click and open the image in a new tab. Warning: it’s big. (But I have to say, the iPhone7 plus makes really nice images. I remember the good old days when we had to take our cameras to CVS to get them developed…)
Another note: from the image, you can see that this is from the fourth lecture of CS 231n class. I watched that video on YouTube, which is excellent and of high-quality. Fortunately, there are also automatic captions which are highly accurate. (There’s an archived reddit thread discussing how Andrej Karpathy had to take down the videos due to a related lawsuit I blogged about earlier, but I can see them just fine. Did they get back up somehow? I’ll write more about this at a later date.)
When I was going through the math here, I came up with several rules to myself:
There’s a lot of notation that can get confusing, so for simplicity, I always denoted inputs as and outputs as , though in this example, we only have one output at each step. By doing this, I can view the s as a function of the terms, so the local gradient turns into and then I can substitute in terms of the inputs.
When doing backpropgation, I analyzed it node-by-node, and the boxes I drew in my image contain a number which indicates the order I evaluated them. (I skipped a few repeat blocks just as the lecture did.) Note that when filling in my boxes, I only used the node and any incoming/outgoing arrows. Also, the and keep getting repeated, i.e. the next step will have equal to whatever the was in the previous block.
Always remember that when we have arrows here, the part above the arrow contains the value of (respectively, ) and below the arrow we have (respectively ).
Hopefully this will be helpful to beginners using computational graphs.
Vector/Matrix/Tensor Derivatives, With Examples
Now let’s get to the big guns — vectors/matrices/tensors. Vectors are a special case of matrices, which are a special case of tensors, the most generalized -dimensional array. For this section, I will continue using the “partial derivative” notation to represent any derivative form (scalar, vector, or matrix).
Our first example will be with ReLUs, because that was covered a bit in the
CS 231n lecture. Let’s suppose , a 3-D column vector
representing some data from a hidden layer deep into the network. The ReLU
operation’s forward pass is extremely simple: , which can be
The backward pass is where things get tricky. The input is a 3-D vector, and so is the output! Hence, taking the derivative of the function means we have to consider the effect of every on every . The only way that’s possible is to use Jacobians. Using the example here, denoting the derivative as where is a function of , we have:
The most interesting part of this happens when we expand the Jacobian and see that we have a bunch of derivatives, but they all evaluate to zero on the off-diagonal. After all, the effect (i.e. derivative) of will be zero for the function . The diagonal term is only slightly more complicated: an indicator function (which evaluates to either 0 or 1) depending on the outcome of the ReLU. This means we have to cache the result of the forward pass, which easy to do in the CS 231n assignments.
How does this get combined into the incoming (i.e. “upstream”) gradient, which is a vector . We perform a matrix times vector operation with that and our Jacobian from above. Thus, the overall gradient we have for with respect to the loss function, which is what we wanted all along, is:
This is as simple as doing
mask * y_grad where
mask is a numpy array with 0s
and 1s depending on the value of the indicator functions, and
y_grad is the
upstream derivative/gradient. In other words, we can completely bypass the
Jacobian computation in our Python code! Another option is to use
0] = 0, where
x is the data that was passed in the forward pass (just before
ReLU was applied). In numpy, this will set all indices to which the condition
<= 0 is true to have zero value, precisely clearing out the gradients where we
need it cleared.
In practice, we tend to use mini-batches of data, so instead of a single , we have a matrix with
columns.1 Denote the th column as . Writing out
the full Jacobian is too cumbersome in this case, but to visualize it, think of
having and then stacking the two samples into a
six-dimensional vector. Do the same for the output . The
Jacobian turns out to again be a diagonal matrix, particularly because the
derivative of on the output is zero for .
Thus, we can again use a simple masking, element-wise multiply on the upstream
gradient to compute the local gradient of w.r.t. . In our code we
don’t have to do any “stacking/destacking”; we can actually use the exact same
mask * y_grad with both of these being 2-D numpy arrays (i.e. matrices)
rather than 1-D numpy arrays. The case is similar for larger minibatch sizes
Remark: this process of computing derivatives will be similar to other activation functions because they are elementwise operations.
Affine Layer (Fully Connected), Biases
Now let’s discuss a layer which isn’t elementwise: the fully connected layer operation . How do we compute gradients? To start, let’s consider one 3-D element so that our operation is
According to the chain rule, the local gradient with respect to is
Since we’re doing backpropagation, we can assume the upstream derivative is given, so we only need to compute the Jacobian. To do so, observe that
and a similar case happens for the second component. The off-diagonal terms are zero in the Jacobian since has no effect on for . Hence, the local derivative is
That’s pretty nice — all we need to do is copy the upstream derivative. No additional work necessary!
Now let’s get more realistic. How do we extend this when is a matrix? Let’s continue the same notation as we did in the ReLU case, so that our columns are for . Thus, we have:
Remark: crucially, notice that the elements of are repeated across columns.
How do we compute the local derivative? We can try writing out the derivative rule as we did before:
but the problem is that this isn’t matrix multiplication. Here, is a function from to , and to evaluate the derivative, it seems like we would need a 3-D matrix for full generality.
Fortunately, there’s an easier way with computational graphs. If you draw out the computational graph and create nodes for , you see that you have to write plus nodes to get the output, each of which takes in one of these terms along with adding . Then this produces . See my hand-drawn diagram:
This captures the key property of independence among the samples in . To compute the local gradients for , it therefore suffices to compute the local gradients for each of the and then add them together. (The rule in computational graphs is to add incoming derivatives, which can be verified by looking at trivial 1-D examples.) The gradient is
See what happened? This immediately reduced to the same case we had earlier,
with a Jacobian being multiplied by a upstream
derivative. All of the Jacobians turn out to be the identity, meaning that the
final derivative is the sum of the columns of
the original upstream derivative matrix . As a sanity check, this is a
-dimensional vector, as desired. In numpy, one can do this with
something similar to
np.sum(Y_grad), though you’ll probably need the
argument to make sure the sum is across the appropriate dimension.
Affine Layer (Fully Connected), Weight Matrix
Going from biases, which are represented by vectors, to weights, which are represented by matrices, brings some extra difficulty due to that extra dimension.
Let’s focus on the case with one sample . For the derivative with respect to , we can ignore since the multivariate chain rule states that the expression differentiated with respect to causes to disappear, just like in the scalar case.
The harder part is dealing with the chain rule for the expression, because we can’t write the expression “”. The function is a vector, and the variable we’re differentiating here is a matrix. Thus, we’d again need a 3-D like matrix to contain the derivatives.
Fortunately, there’s an easier way with the chain rule. We can still use the rule, except we have to sum over the intermediate components, as specified by the chain rule for higher dimensions; see the Wikipedia article for more details and justification. Our “intermediate component” here is the vector, which has two components. We therefore have:
We fortunately see that it simplifies to a simple matrix product! This seems to suggest the following rule: try to simplify any expressions to straightforward Jacobians, gradients, or scalar derivatives, and sum over as needed. Above, splitting the components of allowed us to utilize the derivative since is now a real-valued function, thus enabling straightforward gradient derivations. It also meant the upstream derivative could be analyzed component-by-component, making our lives easier.
A similar case holds for when we have multiple columns in . We would have another sum above, over the columns, but fortunately this can be re-written as matrix multiplication.
How do we compute the convolutional layer gradients? That’s pretty complicated so I’ll leave that as an exercise for the reader. For now.
In fact, is in general a tensor. Sophisticated software packages will generalize to be tensors. For example, we need to add another dimension to with image data since we’ll be using, say, data instead of data (or data in my trivial example here). However, for the sake of simplicity and intuition, I will deal with simple column vectors as samples within a matrix . ↩