Forward and Backward Convolution Passes as Matrix Multiplication

As part of my CS 182/282A GSI duties, I have been reviewing the homework assignments and the CS 231n online notes. I don’t do the entire assignments, as that would take too much time away from research, but I do enough to advise the students. Incidentally, I hope they enjoy having me as a GSI! I try to pepper my discussion sections with lots of humor, and I also explain how I “think” through certain computations. I hope that is helpful.

One reason why I’m a huge fan of CS 231n is that, more than two years ago, I stated that their assignments (which CS 182/282A uses) are the best way to learn how backpropagation works, and I still stand by that comment today. In this post, I’d like to go through the backwards pass for a 1D convolution in some detail, and frame it in the lens of matrix multiplication.

My main reference for this will be the CS 231n notes. They are still highly relevant today. (Since they haven’t been updated in a while, I hope this doesn’t mean that Deep Learning as a field has started to “stabilize” …) With respect to the convolution operator, there are two main passages in the notes that interest me. The first explains how to implement convolutions as matrix multiplication:

Implementation as Matrix Multiplication. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: […]

This allows convolutions to utilize fast, highly-optimized matrix multiplication libraries.

The second relevant passage from the 231n notes mentions how to do the backward pass for a convolution operation:

Backpropagation. The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now).

As usual, I like to understand these through a simple example. Consider a 1D convolution where we have input vector $\begin{bmatrix}x_1 & x_2 & x_3 & x_4\end{bmatrix}^T$ and three weight filters $w_1$, $w_2$, and $w_3$. With a stride of 1 and a padding of 1 on the input, we can implement the convolution operator using the following matrix-vector multiply:

\[\begin{equation} \underbrace{\begin{bmatrix} w_1 & w_2 & w_3 & 0 & 0 & 0 \\ 0 & w_1 & w_2 & w_3 & 0 & 0 \\ 0 & 0 & w_1 & w_2 & w_3 & 0 \\ 0 & 0 & 0 & w_1 & w_2 & w_3 \\ \end{bmatrix}}_{W \in \mathbb{R}^{4\times 6}} \underbrace{\begin{bmatrix}0 \\ x_1 \\ x_2 \\ x_3 \\ x_4 \\ 0 \end{bmatrix}}_{\mathbf{x}' \in \mathbb{R}^6} = \underbrace{\begin{bmatrix} x_1w_2 + x_2w_3 \\ x_1w_1 + x_2w_2 + x_3w_3\\ x_2w_1 + x_3w_2 + x_4w_3\\ x_3w_1 + x_4w_2 \\ \end{bmatrix}}_{\mathbf{o} \in \mathbb{R}^4} \end{equation}\]

or more concisely, $W\mathbf{x}’ = \mathbf{o}$ where $\mathbf{x}’ \in \mathbb{R}^6$ is the padded version of $\mathbf{x} \in \mathbb{R}^4$.

As an aside, you can consider what happens with a “transposed” version of the $W$ matrix. I won’t go through the details, but it’s possible to have the matrix-vector multiply be an operator that increases the dimension of the output. (Normally, convolutions decrease the spatial dimension(s) of the input, though they keep the depth consistent.) Justin Johnson calls these “transposed convolutions” for the reason that $W^T$ can be used to implement the operator. Incidentally, he will start as an assistant professor at the University of Michigan later this fall – congratulations to him!

In the backwards pass with loss function $L$, let’s suppose we’re given some upstream gradient, so $\frac{\partial L}{\partial o_i}$ for all components in the output vector $\mathbf{o}$. How can we do the backwards pass for the weights and then the data?

Let’s go through the math. I will now assume the relevant vectors and their derivatives w.r.t. $L$ are row vectors (following the CS 182/282A notation), though for the other way around it shouldn’t matter, we’d just flip the order of multiplication to be matrix-vector rather than vector-matrix.

We have:

\[\underbrace{\frac{\partial L}{\partial \mathbf{x}}}_{1\times 4} = \underbrace{\frac{\partial L}{\partial \mathbf{o}}}_{1 \times 4} \underbrace{\frac{\partial \mathbf{o}}{\partial \mathbf{x}}}_{4\times 4} = \begin{bmatrix} \frac{\partial L}{\partial o_1} & \frac{\partial L}{\partial o_2} & \frac{\partial L}{\partial o_3} & \frac{\partial L}{\partial o_4} \end{bmatrix} \begin{bmatrix} \frac{\partial o_1}{\partial x_1} & \frac{\partial o_1}{\partial x_2} & \frac{\partial o_1}{\partial x_3} & \frac{\partial o_1}{\partial x_4} \\ \frac{\partial o_2}{\partial x_1} & \frac{\partial o_2}{\partial x_2} & \frac{\partial o_2}{\partial x_3} & \frac{\partial o_2}{\partial x_4} \\ \frac{\partial o_3}{\partial x_1} & \frac{\partial o_3}{\partial x_2} & \frac{\partial o_3}{\partial x_3} & \frac{\partial o_3}{\partial x_4} \\ \frac{\partial o_4}{\partial x_1} & \frac{\partial o_4}{\partial x_2} & \frac{\partial o_4}{\partial x_3} & \frac{\partial o_4}{\partial x_4} \end{bmatrix}\]

and

\[\underbrace{\frac{\partial L}{\partial \mathbf{w}}}_{1\times 3} = \underbrace{\frac{\partial L}{\partial \mathbf{o}}}_{1 \times 4} \underbrace{\frac{\partial \mathbf{o}}{\partial \mathbf{w}}}_{4 \times 3} = \begin{bmatrix} \frac{\partial L}{\partial o_1} & \frac{\partial L}{\partial o_2} & \frac{\partial L}{\partial o_3} & \frac{\partial L}{\partial o_4} \end{bmatrix} \begin{bmatrix} \frac{\partial o_1}{\partial w_1} & \frac{\partial o_1}{\partial w_2} & \frac{\partial o_1}{\partial w_3} \\ \frac{\partial o_2}{\partial w_1} & \frac{\partial o_2}{\partial w_2} & \frac{\partial o_2}{\partial w_3} \\ \frac{\partial o_3}{\partial w_1} & \frac{\partial o_3}{\partial w_2} & \frac{\partial o_3}{\partial w_3} \\ \frac{\partial o_4}{\partial w_1} & \frac{\partial o_4}{\partial w_2} & \frac{\partial o_4}{\partial w_3} \end{bmatrix}.\]

Recall that in our example, $\mathbf{x} \in \mathbb{R}^4$ and $\mathbf{w} \in \mathbb{R}^3$. This must be the same shape as their gradients, since the loss is a scalar.

Notice that all the elements in the Jacobians above are from trivial dot products. For example:

\[\frac{\partial o_1}{\partial x_1} = \frac{\partial}{\partial x_1} \Big( x_1w_2 + x_2w_3 \Big) = w_2.\]

By repeating this process, we end up with:

\[\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{o}} \frac{\partial \mathbf{o}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial L}{\partial o_1} & \frac{\partial L}{\partial o_2} & \frac{\partial L}{\partial o_3} & \frac{\partial L}{\partial o_4} \end{bmatrix} \begin{bmatrix} w_2 & w_3 & 0 & 0 \\ w_1 & w_2 & w_3 & 0 \\ 0 & w_1 & w_2 & w_3 \\ 0 & 0 & w_1 & w_2 \end{bmatrix}\]

and

\[\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial \mathbf{o}} \frac{\partial \mathbf{o}}{\partial \mathbf{w}} = \begin{bmatrix} \frac{\partial L}{\partial o_1} & \frac{\partial L}{\partial o_2} & \frac{\partial L}{\partial o_3} & \frac{\partial L}{\partial o_4} \end{bmatrix} \begin{bmatrix} 0 & x_1 & x_2 \\ x_1 & x_2 & x_3 \\ x_2 & x_3 & x_4 \\ x_3 & x_4 & 0 \end{bmatrix}.\]

Upon seeing the two above operations, it should now be clear why these are viewed as convolution operations. In particular, they’re convolutions where the previous incoming (or “upstream” in CS231n verbiage) gradients act as the input, and the Jacobian encodes the convolution operator’s “filters.” If it helps, feel free to transpose the whole thing above to get it in line with my matrix-vector multiply near the beginning of the post.

Now, why does 231n say that filters are “spatially flipped?” It’s easiest to draw this out on pencil and paper by looking at how the math works out for each component in the convolution. Let’s look at the computation for $\frac{\partial L}{\partial \mathbf{x}}$. Imagine the vector $\frac{\partial L}{\partial \mathbf{o}}$ as input to the convolution. The vector-matrix multiply above will result in a filter from $\mathbf{w}$ sliding through from left-to-right (i.e., starting from $\frac{\partial L}{\partial o_1}$) but with the filter actually in reverse: $(w_3,w_2,w_1)$. Technically, the input actually needs to be padded by 1, and the stride for the filter is 1.

For $\frac{\partial L}{\partial \mathbf{w}}$, the filter is now from $\mathbf{x}$. This time, while the filter itself is in the same order, as in $(x_1,x_2,x_3,x_4)$, it is applied in reverse, from right-to-left on the input vector, so the first computation is for $\frac{\partial L}{\partial o_4}$. I assume that’s what the notes mean as “spatially flipped” though it feels a bit misleading in this case. Perhaps I’m missing something? Again, note that we pad 1 on the input and use a stride of 1 for the filter.

In theory, generalizing to 2D is, as Professor John Canny has repeated said both to me individually and the CS 182/282A class more broadly, just a matter of tracking and then regrouping indices. In practice, unless you’re able to track indices as well as he can, it’s very error-prone. Be very careful. Ideally this is what the CS 182/282A students did to implement the backwards pass, rather than resort to looking up a solution online from someone’s blog post.