Mathematical Tricks Commonly Used in Machine Learning and Statistics
I have passionately studied various machine learning and statistical concepts over the last few years. One thing I’ve learned from all this is that there are many mathematical “tricks” involved, whether or not they are explicitly stated. (In research papers, such tricks are often used without acknowledgment since it is assumed that anyone who can benefit from reading the paper has the mathematical maturity to fill in the details.) I thought it would be useful for me, and hopefully for a few interested readers, to catalogue a set of the common tricks here, and to see them applied in a few examples.
The following list, in alphabetical order, is a nonexhaustive set of tricks that I’ve seen:
 CauchySchwarz
 Integrating Probabilities into Expectations
 Introducing an Independent Copy
 Jensen’s Inequality
 Law of Iterated Expectation
 Lipschitz Functions
 Markov’s Inequality
 Norm Properties
 Series Expansions (e.g. Taylor’s)
 Stirling’s Approximation
 Symmetrization
 Take a Derivative
 Union Bound
 Variational Representations
If the names are unclear or vague, the examples below should clarify. All the tricks are used except for the law of iterated expectation, i.e. . (No particular reason for that omission; it just turns out the exercises I’m interested in didn’t require it.)
Example 1: Maximum of (Not Necessarily Independent!) subGaussians
I covered this problem in my last post here so I will not repeat the details. However, there are two extensions to that exercise which I thought would be worth noting.
First, To prove an upper bound for the random variable , it suffices to proceed as we did earlier in the nonabsolute value case, but augment our subGaussian variables with the set . It’s OK to do this because no independence assumptions are needed. Then it turns out that an upper bound can be derived as
This is the same as what we had earlier, except the “2” is now outside the square root. It’s quite intuitive.
Second, consider how we can prove the following bound:
We start by applying the standard technique of multiplying by , exponentiating and then applying Markov’s Inequality with our nonnegative random variable :
where in (i) we used a bound previously determined in our bound on (it came out of an intermediate step), and then used the fact that the term in the exponential is a convex quadratic to find the minimizer value via differentiation in (ii).
At this point, to satisfy the desired inequality, we compare terms in the exponentials and claim that with ,
This will result in our desired bound. It therefore remains to prove this, but it reduces to checking that
and the left hand side is nonnegative. Hence, the desired bound holds.
Tricks used:
 Jensen’s Inequality
 Markov’s Inequality
 Take a Derivative
 Union Bound
Comments: My earlier blog post (along with this one) shows what I mean when I say “take a derivative.” It happens when there is an upper bound on the right hand side and we have a free parameter (or ) which we can optimize to get the tighest possible bound. Often times, such a is explicitly introduced via Markov’s Inequality, as we have here. Just make sure to double check that when taking a derivative, you’re getting a minimum, not a maximum. In addition, Markov’s Inequality can only be applied to nonnegative random variables, which is why we often have to exponentiate the terms inside a probability statement first.
Note the use of convexity of the exponential function. It is very common to see Jensen’s inequality applied with the exponential function. Always remember that !!
The procedure that I refer to as the “union bound” when I bound a maximum by a sum isn’t exactly the canonical way of doing it, since that typically involves probabilities, but it has a similar flavor. More formally, the union bound states that
for countable sets of events . When we define a set of events based on a maximum of certain variables, that’s the same as taking the union of the individual events.
On a final note, be on the lookout for applications of this type whenever a “maximum” operation is seen with something that resembles Gaussians. Sometimes this can be a bit subtle. For instance, it’s not uncommon to use a bound of the form above when dealing with , the expectation of the norm of a standard Gaussian vector. In addition, when dealing with sparsity, often our “” or “” is actually something like or another combinatoricsstyle value. Seeing a “log” accompanied by a square root is a good clue and may help identify such cases.
Example 2: Bounded Random Variables are SubGaussian
This example is really split into two parts. The first is as follows:
Prove that Rademacher random variables are subGaussian with parameter .
The next is:
Prove that if is a zeromean and has support , then is subGaussian with parameter (at most) .
To prove the first part, let be a Rademacher random variable. For , we have
and thus the claim is satisfied by the definition of a subGaussian random variable. In (i), we removed the expectation by using facts from Rademacher random variables, in (ii) we used the series expansions of the exponential function, in (iii) we simplified by removing the odd powers, in (iv) we used the clever trick that , and in (v) we again used the exponential function’s power series.
To prove the next part, observe that for any , we have
which shows by definition that is subGaussian with parameter . In (i), we cleverly introduce an extra independent copy inside the exponent. It’s zeromean, so we can insert it there without issues.^{1} In (ii), we use Jensen’s inequality, and note that we can do this with respect to just the random variable . (If this is confusing, just think of the expression as a function of and ignore the outer expectation.) In (iii) we apply a clever symmetrization trick by multiplying a Rademacher random variable to . The reason why we can do this is that is already symmetric about zero. Hence, inserting the Rademacher factor will maintain that symmetry (since Rademachers are only +1 or 1). In (iv), we applied the Rademacher subGaussian bound with held fixed, and then in (v), we finally use the fact that .
Tricks used:
 Introducing an Independent Copy
 Jensen’s Inequality
 Series Expansions (twice!!)
 Symmetrization
Comments: The first part is a classic exercise in theoretical statistics, one which tests your ability to understand how to use the power series of exponential functions. The first part involved converting an exponential function to a power series, and then later doing the reverse. When I was doing this problem, I found it easiest to start by stating the conclusion — that we would have somehow — and then I worked backwards. Obviously, this only works when the problem gives us the solution!
The next part is also “classic” in the sense that it’s often how students (such as myself) are introduced to the symmetrization trick. The takeaway is that one should be on the lookout for anything that seems symmetric. Or, failing that, perhaps introduce symmetry by adding in an extra independent copy, as we did above. But make sure that your random variables are zeromean!!
Example 3: Concentration Around Median and Means
Here’s the question:
Given a scalar random variable , suppose that there are positive constants such that
for all .
(a) Prove that
(b) Prove that for any median , we have
for all , where and .
To prove the first part, note that
where (i) follows from definition, (ii) follows from the “integrating probabilities into expectations” trick (which I will describe shortly), (iii) follows from the provided bound, and (iv) follows from standard calculus (note the multiplication of for mathematical convenience). This proves the first claim.
This second part requires some clever insights to get this to work. One way to start is by noting that:
and where the last inequality follows from the bound provided in the question. For us to be able to apply that bound, assume without loss of generality that , meaning that our term is positive and that we can increase the probability by inserting in absolute values. The above also shows that
We next tackle the core of the question. Starting from the left hand side of the desired bound, we get
where step (i) follows from adding zero, step (ii) follows from the Triangle Inequality, and (iii) follows from the provided bound based on the expectation. And yes, this is supposed to work only for when . The way to get around this is that we need to assume is greater than some quantity. After some algebra, it turns out a nice condition for us to enforce is that , which in turn will make . If , then the desired bound is attained because
a fact which can be derived through some algebra. Thus, the remainder of the proof boils down to checking the case that when , we have
and this is proved by analyzing roots of the quadratic and solving for .
Tricks used:
 Integrating Probabilities into Expectations
 Triangle Inequality
Comments: The trick “integrating probabilities into expectations” is one which I only recently learned about, though one can easily find it (along with the derivation) on the Wikipedia page for the expected values. In particular, note that for a positive real number , we have
and in the above, I use this trick with . It’s quite useful to convert between probabilities and expectations!
The other trick above is using the triangle inequality in a clever way. The key is to observe that when we have something like , if we increase the value of , then we increase that probability. This is another common trick used in proving various bounds.
Finally, the above also shows that when we have constants , it pays to be clever in how we assign those values. Then the remainder is some bruteforce computation. I suppose it also helps to think about inserting s whenever we have a probability and a median.
Example 4: Upper Bounds for “Balls”
Consider the set
We often write the number of nonzeros in as like this even though is not technically a norm. This exercise consists of three parts:
(a) Show that where consists of all subsets of of size , and is a subvector of (of size ) indexed by those components. Note that by this definition, the cardinality of is equal to .
(b) Show that for any fixed subset of cardinality , we have .
(c) Establish the claim that .
To be clear on the notation, and refers to the Gaussian complexity of that set. It is, roughly speaking, a way to measure the “size” of a set.
To prove (a), let and let indicate the support of (i.e. where its nonzeros occur). For any (which we later treat to be sampled from , though the immediate analysis below does not require that fact) we have
where refers to the vector taking only the nonzero components from . The first inequality follows from CauchySchwarz. In addition, by standard norm properties, taking results in the case when equality is attained. The claim thus follows. (There are some technical details needed regarding which of the maximums — over the set sizes or over the vector selection — should come first, but I don’t think the details are critical for me to know.)
For (b), we first claim that the function defined as is Lipschitz with respect to the Euclidean norm with Lipschitz constant . To see this, observe that when and are both dimensional vectors, we have
where (i) follows from the reverse triangle inequality for normed spaces and (ii) follows from how the vector cannot have more nonzero terms than but must otherwise match it for indices lying in the subset .
The fact that is Lipschitz means that we can apply a theorem regarding tail bounds of Lipschitz functions of Gaussian variables. The function here doesn’t require its input to consist of vectors with IID standard Gaussian components, but we have to assume that the input is like that for the purposes of the theorem/bound to follow. More formally, for all we have
where (i) follows from how and thus we are just decreasing the threshold for the event (hence making it more likely) and (ii) follows from the theorem, which provides an in the denominator of the exponential, but here.
Finally, to prove (c), we first note that the previous part’s theorem guaranteed that the function is subGaussian with parameter . Using this, we have
where (i) applies the bound for a maximum over subGaussian random variables for all the sets (see Example 1 earlier), each with parameter , and (ii) applies an approximate bound due to Stirling’s approximation and ignores the constants of and . The careful reader will note that Example 1 required zeromean subGaussian random variables, but we can generally get around this by, I believe, subtracting away a mean and then readding later.
Tricks used:
 CauchySchwarz
 Jensen’s Inequality
 Lipschitz Functions
 Norm Properties
 Stirling’s Approximation
 Triangle Inequality
Comments: This exercise involves a number of tricks. The fact that follows from how
due to Jensen’s inequality and how for . Fiddling with norms, expectations, and square roots is another common way to utilize Jensen’s inequality (in addition to using Jensen’s inequality with the exponential function, as explained earlier). Moreover, if you see norms in a probabilistic bound statement, you should immediately be thinking of the possibility of using a theorem related to Lipschitz functions.
The example also uses the (reverse!) triangle inequality for norms:
This can come up quite often and is the noncanonical way of viewing the triangle inequality, so watch out!
Finally, don’t forget the trick where we have . This comes from an application of Stirling’s approximation and is seen frequently in cases involving sparsity, where components are “selected” out of total. The maximum over a finite set should also provide a big hint regarding the use of a subGaussian bound over maximums of (subGaussian) variables.
Example 5: Gaussian Complexity of Ellipsoids
Recall that the space consists of all real sequences such that . Given a strictly positive sequence , consider the associated ellipse
(a) Prove that the Gaussian complexity satisfies the bounds
(b) For a given radius , consider the truncated set
Obtain upper and lower bounds on its Gaussian complexity that are tight up to universal constants independent of and .
To prove (a), we first start with the upper bound. Letting indicate a sequence of IID standard Gaussians , we have
where (i) follows from definition, (ii) follows from multiplying by one, (iii) follows from a clever application of the CauchySchwarz inequality for sequences (or more generally, Holder’s Inequality), (iv) follows from the definition of , (v) follows from Jensen’s inequality, and (vi) follows from linearity of expectation and how .
We next prove the lower bound. First, we note a wellknown result that where indicates the Rademacher complexity of the set. Thus, our task now boils down to showing that . Letting be IID Rademachers, we first begin by proving the upper bound
where (i) follows from definition, (ii) follows from the symmetric nature of the class of (meaning that WLOG we can pick for all ) and then multiplying by one, (iii), follows from CauchySchwarz again, and (iv) follows from the provided bound in the definition of .
We’re not done yet: we actually need to show equality for this, or at the very least prove a lower bound instead of an upper bound. However, if one chooses the valid sequence such that , then equality is attained since we get
in one of our steps above. This proves part (a).
For part (b), we construct two ellipses, one that contains and one which is contained inside it. Let . Then we claim that the ellipse defined out of this sequence (i.e. treating “” as our “”) will be contained in . We moreover claim that the ellipse defined out of the sequence for all contains , i.e. . If this is true, it then follows that
because the definition of Gaussian complexity requires taking a maximum of over a set, and if the set grows larger via set containment, then the Gaussian complexity can only grow larger. In addition, the fact that the upper and lower bounds are related by a constant suggests that there should be extra lower and upper bounds utilizing universal constants independent of and .
Let us prove the two set inclusions previously described, as well as develop the desired upper and lower bounds. Suppose . Then we have
and
In both cases, the first inequality is because we can only decrease the value in the denominator.^{2} The last inequality follows by assumption of membership in . Both requirements for membership in are satisfied, and therefore, implies and thus the first set containment. Moving on to the second set containment, suppose . We have
where (i) follows from a “union bound”style argument, which to be clear, happens because for every term in the summation, we have either or added to the summation (both positive quantities). Thus, to make the value larger, just add both terms! Step (ii) follows from the assumption of membership in . Thus, we conclude that , and we have proved that
The final step of this exercise is to develop a lower bound on the left hand side and an upper bound on the right hand side that are close up to universal constants. But we have reduced this to an instance of part (a)! Thus, we simply apply the lower bound for and the upper bound for and obtain
as our final bounds on . (Note that as a sanity check, the constant offset is less than one.) This proves part (b).
Tricks used:
 CauchySchwarz
 Jensen’s Inequality
 Union Bound
Comments: This exercise on the surface looks extremely challenging. How does one reason about multiple infinite sequences, which furthermore may or may not involve squared terms? I believe the key to tackling these problems is to understand how to apply CauchySchwarz (or more generally, Holder’s Inequality) for infinite sequences. More precisely, Holder’s Inequality for sequences spaces states that
(It’s actually more general for this, since we can assume arbitrary positive powers and so long as , but the easiest case to understand is when .)
Holder’s Inequality is enormously helpful when dealing with sums (whether infinite or not), and especially when dealing with two sums if one does not square its terms, but the other one does.
Finally, again, think about Jensen’s inequality whenever we have expectations and a square root!
Example 6: Pairwise Incoherence
Given a matrix , suppose it has normalized columns ( for all ) and pairwise incoherence upper bounded as .
(a) Let be any subset of size . Show that there is a function such that as long as is sufficiently small, where is the matrix formed by extracting the columns of whose indices are in .
(b) Prove, from first principles, that satisfies the restricted nullspace property with respect to as long as .
To clarify, the pairwise incoherence of a matrix is defined as
where denotes the th column of . Intuitively, it measures the correlation between any columns, though it subtracts an indicator at the end so that the maximal case does not always correspond to the case when . In addition, the matrix as defined in the problem looks like:
where the 1s in the diagonal are due to the assumption of having normalized columns.
First, we prove part (a). Starting from the variational representation of the minimum eigenvalue, we consider any possible with Euclidean norm one (and thus this analysis will apply for the minimizer which induces the minimum eigenvalue) and observe that
where (i) follows from the definition of a quadratic form (less formally, by matrix multiplication), (ii) follows from the assumption, (iii) follows from noting that
which in turn follows from the pairwise incoherence assumption that . Step (iv) follows from definition, and (v) follows from how for dimensional vectors.
The above applies for any satisfactory . Putting together the pieces, we conclude that
which follows if is sufficiently small.
To prove the restricted nullspace property in (b), we first suppose that and . Define dimensional vectors and which match components of for the indices within their respective sets or , and which are zero otherwise.^{3} Supposing that corresponds to the subset of indices of of the largest elements in absolute value, it suffices to show that , because then we can never violate this inequality (and thus the restricted nullspace property holds).
We first show a few facts which we then piece together to get the final result. The first is that
where (i) follows from the assumption that is in the kernel of , (ii) follows from how , (iii) follows from expanding the term, and (iv) follows from carefully noting that
where in the inequality, we have simply chosen as our , which can only make the bound worse. Then step (iv) follows immediately. Don’t forget that , because the latter involves a vector that (while longer) only has extra zeros. Incidentally, the above uses the variational representation for eigenvalues in a way that’s more convenient if we don’t want to restrict our vectors to have Euclidean norm one.
We conclude from the above that
Next, let us upper bound the RHS. We see that
where (i) follows from a little thought about how matrix multiplication and quadratic forms work. In particular, if we expanded out the LHS, we would get a sum with lots of terms that are zero since or would cancel them out. (To be clear, and .) Step (ii) follows from definition, step (iii) follows from the provided Pairwise Incoherence bound (note the need to multiply by ), and step (iv) follows from how
and thus it is clear that the product of the norms consists of the sum of all possible combination of indices with nonzero values.
The last thing we note is that from part (a), if we assumed that , then a lower bound on is . Putting the pieces together, we get the following three inequalities
We can provide a lower bound for the first term above. Using the fact that , we get . The final step is to tie the lower bound here with the upper bound from the set of three inequalities above. This results in
Under the same assumption earlier (that ) it follows directly that , as claimed. Whew!
Tricks used:
 CauchySchwarz
 Norm Properties
 Variational Representation (of eigenvalues)
Comments: Actually, for part (a), one can prove this more directly by using the Gershgorin Circle Theorem, a very useful Theorem with a surprisingly simple proof. But I chose this way above so that we can make use of the variational representation for eigenvalues. There are also variational representations for singular values.
The above uses a lot of norm properties. One example was the use of , which can be proved via CauchySchwarz. The extension to this is that . These are quite handy. Another example, which is useful when dealing with specific subsets, is to understand how the and norms behave. Admittedly, getting all the steps right for part (b) takes a lot of hassle and attention to details, but it is certainly satisfying to see it work.
Closing Thoughts
I hope this post serves as a useful reference for me and to anyone else who might need to use one of these tricks to understand some machine learning and statisticsrelated math.

One of my undergraduate mathematics professors, Steven J. Miller, would love this trick, as his two favorite tricks in mathematics are adding zero (along with, of course, multiplying by one). ↩

Or “downstairs” as professor Michael I. Jordan often puts it (and obviously, “upstairs” for the numerator). ↩

It can take some time and effort to visualize and process all this information. I find it helpful to draw some of these out with pencil and paper, and also to assume without loss of generality that corresponds to the first “block” of , and therefore corresponds to the second (and last) “block.” Please contact me if you spot typos; they’re really easy to make here. ↩