I have passionately studied various machine learning and statistical concepts over the last few years. One thing I’ve learned from all this is that there are many mathematical “tricks” involved, whether or not they are explicitly stated. (In research papers, such tricks are often used without acknowledgment since it is assumed that anyone who can benefit from reading the paper has the mathematical maturity to fill in the details.) I thought it would be useful for me, and hopefully for a few interested readers, to catalogue a set of the common tricks here, and to see them applied in a few examples.

The following list, in alphabetical order, is a non-exhaustive set of tricks that I’ve seen:

  • Cauchy-Schwarz
  • Integrating Probabilities into Expectations
  • Introducing an Independent Copy
  • Jensen’s Inequality
  • Law of Iterated Expectation
  • Lipschitz Functions
  • Markov’s Inequality
  • Norm Properties
  • Series Expansions (e.g. Taylor’s)
  • Stirling’s Approximation
  • Symmetrization
  • Take a Derivative
  • Union Bound
  • Variational Representations

If the names are unclear or vague, the examples below should clarify. All the tricks are used except for the law of iterated expectation, i.e. . (No particular reason for that omission; it just turns out the exercises I’m interested in didn’t require it.)

Example 1: Maximum of (Not Necessarily Independent!) sub-Gaussians

I covered this problem in my last post here so I will not repeat the details. However, there are two extensions to that exercise which I thought would be worth noting.

First, To prove an upper bound for the random variable , it suffices to proceed as we did earlier in the non-absolute value case, but augment our sub-Gaussian variables with the set . It’s OK to do this because no independence assumptions are needed. Then it turns out that an upper bound can be derived as

This is the same as what we had earlier, except the “2” is now outside the square root. It’s quite intuitive.

Second, consider how we can prove the following bound:

We start by applying the standard technique of multiplying by , exponentiating and then applying Markov’s Inequality with our non-negative random variable :

where in (i) we used a bound previously determined in our bound on (it came out of an intermediate step), and then used the fact that the term in the exponential is a convex quadratic to find the minimizer value via differentiation in (ii).

At this point, to satisfy the desired inequality, we compare terms in the exponentials and claim that with ,

This will result in our desired bound. It therefore remains to prove this, but it reduces to checking that

and the left hand side is non-negative. Hence, the desired bound holds.

Tricks used:

  • Jensen’s Inequality
  • Markov’s Inequality
  • Take a Derivative
  • Union Bound

Comments: My earlier blog post (along with this one) shows what I mean when I say “take a derivative.” It happens when there is an upper bound on the right hand side and we have a free parameter (or ) which we can optimize to get the tighest possible bound. Often times, such a is explicitly introduced via Markov’s Inequality, as we have here. Just make sure to double check that when taking a derivative, you’re getting a minimum, not a maximum. In addition, Markov’s Inequality can only be applied to non-negative random variables, which is why we often have to exponentiate the terms inside a probability statement first.

Note the use of convexity of the exponential function. It is very common to see Jensen’s inequality applied with the exponential function. Always remember that !!

The procedure that I refer to as the “union bound” when I bound a maximum by a sum isn’t exactly the canonical way of doing it, since that typically involves probabilities, but it has a similar flavor. More formally, the union bound states that

for countable sets of events . When we define a set of events based on a maximum of certain variables, that’s the same as taking the union of the individual events.

On a final note, be on the look-out for applications of this type whenever a “maximum” operation is seen with something that resembles Gaussians. Sometimes this can be a bit subtle. For instance, it’s not uncommon to use a bound of the form above when dealing with , the expectation of the -norm of a standard Gaussian vector. In addition, when dealing with sparsity, often our “” or “” is actually something like or another combinatorics-style value. Seeing a “log” accompanied by a square root is a good clue and may help identify such cases.

Example 2: Bounded Random Variables are Sub-Gaussian

This example is really split into two parts. The first is as follows:

Prove that Rademacher random variables are sub-Gaussian with parameter .

The next is:

Prove that if is a zero-mean and has support , then is sub-Gaussian with parameter (at most) .

To prove the first part, let be a Rademacher random variable. For , we have

and thus the claim is satisfied by the definition of a sub-Gaussian random variable. In (i), we removed the expectation by using facts from Rademacher random variables, in (ii) we used the series expansions of the exponential function, in (iii) we simplified by removing the odd powers, in (iv) we used the clever trick that , and in (v) we again used the exponential function’s power series.

To prove the next part, observe that for any , we have

which shows by definition that is sub-Gaussian with parameter . In (i), we cleverly introduce an extra independent copy inside the exponent. It’s zero-mean, so we can insert it there without issues.1 In (ii), we use Jensen’s inequality, and note that we can do this with respect to just the random variable . (If this is confusing, just think of the expression as a function of and ignore the outer expectation.) In (iii) we apply a clever symmetrization trick by multiplying a Rademacher random variable to . The reason why we can do this is that is already symmetric about zero. Hence, inserting the Rademacher factor will maintain that symmetry (since Rademachers are only +1 or -1). In (iv), we applied the Rademacher sub-Gaussian bound with held fixed, and then in (v), we finally use the fact that .

Tricks used:

  • Introducing an Independent Copy
  • Jensen’s Inequality
  • Series Expansions (twice!!)
  • Symmetrization

Comments: The first part is a classic exercise in theoretical statistics, one which tests your ability to understand how to use the power series of exponential functions. The first part involved converting an exponential function to a power series, and then later doing the reverse. When I was doing this problem, I found it easiest to start by stating the conclusion — that we would have somehow — and then I worked backwards. Obviously, this only works when the problem gives us the solution!

The next part is also “classic” in the sense that it’s often how students (such as myself) are introduced to the symmetrization trick. The takeaway is that one should be on the lookout for anything that seems symmetric. Or, failing that, perhaps introduce symmetry by adding in an extra independent copy, as we did above. But make sure that your random variables are zero-mean!!

Example 3: Concentration Around Median and Means

Here’s the question:

Given a scalar random variable , suppose that there are positive constants such that

for all .

(a) Prove that

(b) Prove that for any median , we have

for all , where and .

To prove the first part, note that

where (i) follows from definition, (ii) follows from the “integrating probabilities into expectations” trick (which I will describe shortly), (iii) follows from the provided bound, and (iv) follows from standard calculus (note the multiplication of for mathematical convenience). This proves the first claim.

This second part requires some clever insights to get this to work. One way to start is by noting that:

and where the last inequality follows from the bound provided in the question. For us to be able to apply that bound, assume without loss of generality that , meaning that our term is positive and that we can increase the probability by inserting in absolute values. The above also shows that

We next tackle the core of the question. Starting from the left hand side of the desired bound, we get

where step (i) follows from adding zero, step (ii) follows from the Triangle Inequality, and (iii) follows from the provided bound based on the expectation. And yes, this is supposed to work only for when . The way to get around this is that we need to assume is greater than some quantity. After some algebra, it turns out a nice condition for us to enforce is that , which in turn will make . If , then the desired bound is attained because

a fact which can be derived through some algebra. Thus, the remainder of the proof boils down to checking the case that when , we have

and this is proved by analyzing roots of the quadratic and solving for .

Tricks used:

  • Integrating Probabilities into Expectations
  • Triangle Inequality

Comments: The trick “integrating probabilities into expectations” is one which I only recently learned about, though one can easily find it (along with the derivation) on the Wikipedia page for the expected values. In particular, note that for a positive real number , we have

and in the above, I use this trick with . It’s quite useful to convert between probabilities and expectations!

The other trick above is using the triangle inequality in a clever way. The key is to observe that when we have something like , if we increase the value of , then we increase that probability. This is another common trick used in proving various bounds.

Finally, the above also shows that when we have constants , it pays to be clever in how we assign those values. Then the remainder is some brute-force computation. I suppose it also helps to think about inserting s whenever we have a probability and a median.

Example 4: Upper Bounds for “Balls”

Consider the set

We often write the number of nonzeros in as like this even though is not technically a norm. This exercise consists of three parts:

(a) Show that where consists of all subsets of of size , and is a sub-vector of (of size ) indexed by those components. Note that by this definition, the cardinality of is equal to .

(b) Show that for any fixed subset of cardinality , we have .

(c) Establish the claim that .

To be clear on the notation, and refers to the Gaussian complexity of that set. It is, roughly speaking, a way to measure the “size” of a set.

To prove (a), let and let indicate the support of (i.e. where its nonzeros occur). For any (which we later treat to be sampled from , though the immediate analysis below does not require that fact) we have

where refers to the vector taking only the nonzero components from . The first inequality follows from Cauchy-Schwarz. In addition, by standard norm properties, taking results in the case when equality is attained. The claim thus follows. (There are some technical details needed regarding which of the maximums — over the set sizes or over the vector selection — should come first, but I don’t think the details are critical for me to know.)

For (b), we first claim that the function defined as is Lipschitz with respect to the Euclidean norm with Lipschitz constant . To see this, observe that when and are both -dimensional vectors, we have

where (i) follows from the reverse triangle inequality for normed spaces and (ii) follows from how the vector cannot have more nonzero terms than but must otherwise match it for indices lying in the subset .

The fact that is Lipschitz means that we can apply a theorem regarding tail bounds of Lipschitz functions of Gaussian variables. The function here doesn’t require its input to consist of vectors with IID standard Gaussian components, but we have to assume that the input is like that for the purposes of the theorem/bound to follow. More formally, for all we have

where (i) follows from how and thus we are just decreasing the threshold for the event (hence making it more likely) and (ii) follows from the theorem, which provides an in the denominator of the exponential, but here.

Finally, to prove (c), we first note that the previous part’s theorem guaranteed that the function is sub-Gaussian with parameter . Using this, we have

where (i) applies the bound for a maximum over sub-Gaussian random variables for all the sets (see Example 1 earlier), each with parameter , and (ii) applies an approximate bound due to Stirling’s approximation and ignores the constants of and . The careful reader will note that Example 1 required zero-mean sub-Gaussian random variables, but we can generally get around this by, I believe, subtracting away a mean and then re-adding later.

Tricks used:

  • Cauchy-Schwarz
  • Jensen’s Inequality
  • Lipschitz Functions
  • Norm Properties
  • Stirling’s Approximation
  • Triangle Inequality

Comments: This exercise involves a number of tricks. The fact that follows from how

due to Jensen’s inequality and how for . Fiddling with norms, expectations, and square roots is another common way to utilize Jensen’s inequality (in addition to using Jensen’s inequality with the exponential function, as explained earlier). Moreover, if you see norms in a probabilistic bound statement, you should immediately be thinking of the possibility of using a theorem related to Lipschitz functions.

The example also uses the (reverse!) triangle inequality for norms:

This can come up quite often and is the non-canonical way of viewing the triangle inequality, so watch out!

Finally, don’t forget the trick where we have . This comes from an application of Stirling’s approximation and is seen frequently in cases involving sparsity, where components are “selected” out of total. The maximum over a finite set should also provide a big hint regarding the use of a sub-Gaussian bound over maximums of (sub-Gaussian) variables.

Example 5: Gaussian Complexity of Ellipsoids

Recall that the space consists of all real sequences such that . Given a strictly positive sequence , consider the associated ellipse

(a) Prove that the Gaussian complexity satisfies the bounds

(b) For a given radius , consider the truncated set

Obtain upper and lower bounds on its Gaussian complexity that are tight up to universal constants independent of and .

To prove (a), we first start with the upper bound. Letting indicate a sequence of IID standard Gaussians , we have

where (i) follows from definition, (ii) follows from multiplying by one, (iii) follows from a clever application of the Cauchy-Schwarz inequality for sequences (or more generally, Holder’s Inequality), (iv) follows from the definition of , (v) follows from Jensen’s inequality, and (vi) follows from linearity of expectation and how .

We next prove the lower bound. First, we note a well-known result that where indicates the Rademacher complexity of the set. Thus, our task now boils down to showing that . Letting be IID Rademachers, we first begin by proving the upper bound

where (i) follows from definition, (ii) follows from the symmetric nature of the class of (meaning that WLOG we can pick for all ) and then multiplying by one, (iii), follows from Cauchy-Schwarz again, and (iv) follows from the provided bound in the definition of .

We’re not done yet: we actually need to show equality for this, or at the very least prove a lower bound instead of an upper bound. However, if one chooses the valid sequence such that , then equality is attained since we get

in one of our steps above. This proves part (a).

For part (b), we construct two ellipses, one that contains and one which is contained inside it. Let . Then we claim that the ellipse defined out of this sequence (i.e. treating “” as our “”) will be contained in . We moreover claim that the ellipse defined out of the sequence for all contains , i.e. . If this is true, it then follows that

because the definition of Gaussian complexity requires taking a maximum of over a set, and if the set grows larger via set containment, then the Gaussian complexity can only grow larger. In addition, the fact that the upper and lower bounds are related by a constant suggests that there should be extra lower and upper bounds utilizing universal constants independent of and .

Let us prove the two set inclusions previously described, as well as develop the desired upper and lower bounds. Suppose . Then we have

and

In both cases, the first inequality is because we can only decrease the value in the denominator.2 The last inequality follows by assumption of membership in . Both requirements for membership in are satisfied, and therefore, implies and thus the first set containment. Moving on to the second set containment, suppose . We have

where (i) follows from a “union bound”-style argument, which to be clear, happens because for every term in the summation, we have either or added to the summation (both positive quantities). Thus, to make the value larger, just add both terms! Step (ii) follows from the assumption of membership in . Thus, we conclude that , and we have proved that

The final step of this exercise is to develop a lower bound on the left hand side and an upper bound on the right hand side that are close up to universal constants. But we have reduced this to an instance of part (a)! Thus, we simply apply the lower bound for and the upper bound for and obtain

as our final bounds on . (Note that as a sanity check, the constant offset is less than one.) This proves part (b).

Tricks used:

  • Cauchy-Schwarz
  • Jensen’s Inequality
  • Union Bound

Comments: This exercise on the surface looks extremely challenging. How does one reason about multiple infinite sequences, which furthermore may or may not involve squared terms? I believe the key to tackling these problems is to understand how to apply Cauchy-Schwarz (or more generally, Holder’s Inequality) for infinite sequences. More precisely, Holder’s Inequality for sequences spaces states that

(It’s actually more general for this, since we can assume arbitrary positive powers and so long as , but the easiest case to understand is when .)

Holder’s Inequality is enormously helpful when dealing with sums (whether infinite or not), and especially when dealing with two sums if one does not square its terms, but the other one does.

Finally, again, think about Jensen’s inequality whenever we have expectations and a square root!

Example 6: Pairwise Incoherence

Given a matrix , suppose it has normalized columns ( for all ) and pairwise incoherence upper bounded as .

(a) Let be any subset of size . Show that there is a function such that as long as is sufficiently small, where is the matrix formed by extracting the columns of whose indices are in .

(b) Prove, from first principles, that satisfies the restricted nullspace property with respect to as long as .

To clarify, the pairwise incoherence of a matrix is defined as

where denotes the -th column of . Intuitively, it measures the correlation between any columns, though it subtracts an indicator at the end so that the maximal case does not always correspond to the case when . In addition, the matrix as defined in the problem looks like:

where the 1s in the diagonal are due to the assumption of having normalized columns.

First, we prove part (a). Starting from the variational representation of the minimum eigenvalue, we consider any possible with Euclidean norm one (and thus this analysis will apply for the minimizer which induces the minimum eigenvalue) and observe that

where (i) follows from the definition of a quadratic form (less formally, by matrix multiplication), (ii) follows from the assumption, (iii) follows from noting that

which in turn follows from the pairwise incoherence assumption that . Step (iv) follows from definition, and (v) follows from how for -dimensional vectors.

The above applies for any satisfactory . Putting together the pieces, we conclude that

which follows if is sufficiently small.

To prove the restricted nullspace property in (b), we first suppose that and . Define -dimensional vectors and which match components of for the indices within their respective sets or , and which are zero otherwise.3 Supposing that corresponds to the subset of indices of of the largest elements in absolute value, it suffices to show that , because then we can never violate this inequality (and thus the restricted nullspace property holds).

We first show a few facts which we then piece together to get the final result. The first is that

where (i) follows from the assumption that is in the kernel of , (ii) follows from how , (iii) follows from expanding the term, and (iv) follows from carefully noting that

where in the inequality, we have simply chosen as our , which can only make the bound worse. Then step (iv) follows immediately. Don’t forget that , because the latter involves a vector that (while longer) only has extra zeros. Incidentally, the above uses the variational representation for eigenvalues in a way that’s more convenient if we don’t want to restrict our vectors to have Euclidean norm one.

We conclude from the above that

Next, let us upper bound the RHS. We see that

where (i) follows from a little thought about how matrix multiplication and quadratic forms work. In particular, if we expanded out the LHS, we would get a sum with lots of terms that are zero since or would cancel them out. (To be clear, and .) Step (ii) follows from definition, step (iii) follows from the provided Pairwise Incoherence bound (note the need to multiply by ), and step (iv) follows from how

and thus it is clear that the product of the norms consists of the sum of all possible combination of indices with nonzero values.

The last thing we note is that from part (a), if we assumed that , then a lower bound on is . Putting the pieces together, we get the following three inequalities

We can provide a lower bound for the first term above. Using the fact that , we get . The final step is to tie the lower bound here with the upper bound from the set of three inequalities above. This results in

Under the same assumption earlier (that ) it follows directly that , as claimed. Whew!

Tricks used:

  • Cauchy-Schwarz
  • Norm Properties
  • Variational Representation (of eigenvalues)

Comments: Actually, for part (a), one can prove this more directly by using the Gershgorin Circle Theorem, a very useful Theorem with a surprisingly simple proof. But I chose this way above so that we can make use of the variational representation for eigenvalues. There are also variational representations for singular values.

The above uses a lot of norm properties. One example was the use of , which can be proved via Cauchy-Schwarz. The extension to this is that . These are quite handy. Another example, which is useful when dealing with specific subsets, is to understand how the and norms behave. Admittedly, getting all the steps right for part (b) takes a lot of hassle and attention to details, but it is certainly satisfying to see it work.

Closing Thoughts

I hope this post serves as a useful reference for me and to anyone else who might need to use one of these tricks to understand some machine learning and statistics-related math.


  1. One of my undergraduate mathematics professors, Steven J. Miller, would love this trick, as his two favorite tricks in mathematics are adding zero (along with, of course, multiplying by one). 

  2. Or “downstairs” as professor Michael I. Jordan often puts it (and obviously, “upstairs” for the numerator). 

  3. It can take some time and effort to visualize and process all this information. I find it helpful to draw some of these out with pencil and paper, and also to assume without loss of generality that corresponds to the first “block” of , and therefore corresponds to the second (and last) “block.” Please contact me if you spot typos; they’re really easy to make here.