# A Useful Matrix Inverse Equality for Ridge Regression

I was recently reading Stephen Tu’s blog and stumbled upon his matrix inverse equality post which brough back some memories. Here’s the equality: let be an matrix, let , and let represent the identity matrix. Then

Note that both sides are left-multipled by , so for the purposes of checking equality, there’s no need to have it there, so instead consider

I was asked to prove this as part of a homework assignment in my CS 281A class with Professor Ben Recht a while back. Stephen Tu already has an excellent proof based on Singular Value Decomposition, but I’d like to present an alternative proof that seems even simpler to me. I also want to explain why this equality is important.

First, start with . Then left-multiply the left hand side (LHS) by and right-multiply the right hand side (RHS) by to obtain

Note that this doesn’t change the equality, because multiplying a matrix with the identity will not change it, regardless of the multiplication ordering. As a sanity check, the LHS is and the RHS is and both result in a -dimensional matrix so the dimensions match up. Next, add to both sides and get

We now “distribute out” an , though the particular to use is different on both sides. We obtain

Next, left-multiply both sides by and right-multiply both sides by :

Thus proving the identity.

Now, you can leave it like that and in some cases you’ll be OK. But strictly speaking, you should
show that the two inverses above *actually exist* before using them. When I was writing up this
for my homework, I said “the problem implies the inverses exist” because the inverse was already
written in the problem statement. In retrospect, however, that was a mistake. Fortunately, the TA
didn’t take off any points, though he marked “No, but w/e …”

To make things extra clear, we now show that the inverses exist. Consider . We claim its inverse exist because it is positive definite (and all positive definite matrices are invertible). This is straightforward because for any nonzero , we have

where we use the standard trick, which is *everywhere* in machine learning and
optimization. If you don’t believe me, take EE 227B as I did and you’ll never be able to forget
it. By the way, the above shows why we needed , because is not
necessarily greater than zero; it could be zero! This is because could be positive
*semi*-definite, but not positive *definite*.

A similar proof applies for the other inverse, which uses for .

Having proved the equality, why do we care about it? The reason has to do with ridge regression and duality. The problem formulation for ridge regression is

where is the -dimensional vector of targets and . Taking the gradient we get

Setting to zero, we find as

Gee, does this formulation of look familiar? Yep! The above is one way of finding ,
but another way is by invoking the equality to get what is known as the *dual form* of :

Why do we like this new perspective? First, like any dual solution, it provides us with an
alternative method if the “other” way (technically called the *primal* solution) is harder to compute.
In this case, however, I don’t think that’s the main reason why (because this is fairly
straightforward either way). I think we like this representation because is a linear
combination of the rows of , with coefficients specified by the vector ! The rows of are the elements in our data , so this gives
the intuitive interpretation of the optimal solution being some combination of the data elements.