Part 4: Convergence of random variables and limits theorems
Probability Theory: Math 605, Fall 2024
Luc Rey-Bellet
University of Massachusetts Amherst
2024-12-09
1 Convergence of random variables
We think of a (real-valued) random variable as a function X(\omega) and so if we have a sequence of RVs \{ X_n\} we can define various types of convergence.
Convergence almost sure
Convergence in probability
Convergence in L^p
In subsequent chapters we are will study another type convergence, namely weak convergence (also called convergence in distribution). It is of quite different type because it based on on the distribution of X and not on the notion of random variable as a function.
1.1 Almost sure convergence
Definition 1.1 (Almost sure convergence) A sequence of RVs \{X_n\}converges almost surely to a RV X if
\lim_{n} X_n(\omega) = X(\omega) \text{ a.s}
that is P(\{\omega\,:\, \lim_{n} X_n(\omega) = X(\omega)\})=1.
Almost sure convergence is pointwise convergence, the limit is unique if of course we identify RV which are equal a.s.
It will be very useful to rephrase almost sure convergence in a different way. At first sight it looks a bit strange but it is a good idea. We explain the idea first for a sequence \{x_n\} of numbers. Consider the function defined for any \epsilon>0
i_\epsilon(x) = 1_{(\epsilon,\infty)}(x) = \left\{
\begin{array}{cl}
1 & x > \epsilon \\
0 & x \le \epsilon
\end{array}
\right.
Lemma 1.1
A sequence \{x_n\} converges to x if and only if \sum_n i_\epsilon(|x_n-x|) < \infty for every \epsilon.
If there is a non-negative sequence \epsilon_n such that \sum_{n}\epsilon_n <\infty and \sum_{n} i_\epsilon(|x_n-x_{n+1}|) < \infty then x_n converges to some x.
Proof.
Fix \epsilon>0. If x_n \to x then there exists N such that for any n \ge N, |x_n-x|\le \epsilon which means i_\epsilon(|x_n-x|) =0 for n \ge N and thus \sum_n i_\epsilon(|x_n-x|) < \infty. Conversely if \sum_n i_\epsilon(|x_n-x|) < \infty then, since the terms are either 0 or 1, only finitely many terms can be nonzero and thus there exists N such that |x_n-x| \le \epsilon for n \ge N. \quad \square
If this holds there exists N such that |x_n-x_{n+1}|\le \epsilon_n for n \ge N. Taking n>m \ge N gives
|x_n-x_m| \le |x_n-x_{n-1}|+ \cdots + |x_{m+1}-x_m| \le \epsilon_m + \cdots + \epsilon_n \le \sum_{j=m}^\infty \epsilon_j
Since the the sequence \epsilon_n is summable \sum_{j=m}^\infty \epsilon_j goes to 0 as m \to \infty. Therefore the sequence x_n is a Cauchy sequence and thus x_n \to x.
Returning to random variables we find
Theorem 1.1 The sequenceof RV X_n converges almost surely if and only if, for every \epsilon>0,
\sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \quad \text{ almost surely}
\tag{1.1}
Proof.
Let \Omega_0 the set on which convergence holds, then the sum in Equation 1.1 converges for \omega \in \Omega_0 (which is independent of \epsilon).
For the converse the only small issue to deal with is that the set on which the sum converges may depend on \epsilon. So pick a sequence \epsilon_k \searrow 0 and let N_k the value of the sequence in Equation 1.1 and we have P(N_k < \infty)=1. Since \epsilon_{k+1}\le \epsilon_k, i_{\epsilon_{k+1}}\ge i_{\epsilon_{k}} and so N_{k+1}\ge N_k. The events \{N_k < \infty\} are shrinking to
\Omega_0=\{ \omega\,:\, \sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \text{ for all } \epsilon >0 \}
By sequential continuity P(\Omega_0)=\lim_{k} P(N_k < \infty)=1 and thus X_n converges to X almost surely. \quad \square.
At this point we recall the Borel-Cantelli theorem
Lemma 1.2 (Borel Cantelli Lemma) Suppose B_n is a collection of events. Then
\sum_n P(B_n) < \infty \implies \sum_{n} 1_{B_n} < \infty \text{ almost surely}\,.
which we use to prove the following criteria for almost sure convergence
Theorem 1.2
If, for any \epsilon> 0, \sum_{n} P(|X_n-X|\ge \epsilon) < \infty then X_n \to X almost surely.
If there exists a sequence \epsilon_n \searrow 0 such that \sum_{n} P(|X_n-X|\ge \epsilon_n) < \infty, then X_n \to X almost surely.
If there exists a sequence \epsilon_n with \sum_{n}\epsilon_n <\infty and \sum_{n} P(|X_n-X_{n+1}|\ge \epsilon_n) < \infty, then X_n \to X almost surely.
Proof.
By Borel-Cantelli we have \sum_{n} i_\epsilon(|X_n-X|) < \infty a.s. which means a.s convergence.
By the Borel-Cantelli Lemma we have |X_n-X| \le \epsilon_n for all but finitely many n, almost surely. Since \epsilon_n \to 0 this means that X_n converges to X a.s
By Lemma 1.1 and Borel-Cantelli \{X_n\} is a Cauchy sequence a.s. and thus converges a.s.
1.2 Convergence in Probability
Definition 1.2 (Convergence in probability) A sequence of RVs \{X_n\}converges in probability to a RV X if, for any \epsilon> 0,
\lim_{n \to \infty} P( \{\omega\,:\, |X_n(\omega) - X(\omega)|> \epsilon\}) =0 \,.
This is a very useful mode of convergence in probability, in particular due to the fact that, that it is weaker than almost sure convergence and thus easier to prove.
Example: Let \Omega=[0,1) and P_0 be Lebesgue measure. Consider the sequence of RV
X_1 = 1_{[0,\frac12)}, X_2 = 1_{[\frac12,1)}, X_3 = 1_{[0,\frac13)}, X_4 = 1_{[\frac13,\frac23)}, X_5 = 1_{[\frac23,1)},, X_6 = 1_{[0,\frac14)},X_7 = 1_{[1/4,\frac24)} \cdots
\tag{1.2}
We claim that X_n converges to 0 in probability. Indeed for any \epsilon>0, P( |X_n| > \epsilon ) =P(X_n=1) \to 0 since the X_n=1_{I_n} is a charactersitic function of an interval I_n whose measure goes to 0 as n goes to infinity.
The sequence X_n does not converge a.s. Indeed \omega \in [0,1) belong to infinitely many intervals of the form [\frac{k}{n},\frac{k+1}{n}) and does not belong to infinitely many such intervals. Therefore \liminf X_n(\omega)=0 \not= \limsup X_n(\omega)= 1.
The sequence X_n(\omega) has (many!) convergent subsequence which converges to 0 almost surely. To do this choose n_k such that the interval I_{n_k} for X_{n_k}=1_{I_{n_k}} is contained in the interval I_{n_{k-1}} for X_{n_{k-1}}= 1_{I_{n_{k-1}}}.
The relation between almost sure convergence and convergence in probability is contained in the following theorem. The third part, while looking somewhat convoluted is cvery useful to streamline subsequent proofs. It relies on the following simple fact: suppose that the sequence x_n is such that every subsequence has a subsubsequence which converges to x then x_n converges to x.
Theorem 1.3 (Almost sure convergence versus convergence in probability)
If X_n converges almost surely to X then X_n converges in probability to X.
If X_n converges in probability to X then there exists a subsequence X_{n_k} which converges to X almost surely,
If every subsequence has a further subsubsequence which converges to X almost surely, then X_n converges to X in probability.
Proof. Item 1.: If X_n converges to X almost surely then i_\epsilon(|X_n-X|) converges to 0 almost surely. By the bounded convergence theorem this implies E[i_\epsilon(|X_n-X|)]=P(|X_n-X|\ge \epsilon) converges to 0.
Item 2.: If X_n converges to X in probability then pick \epsilon_k =\frac{1}{k} \searrow 0. Since P(|X_n(\omega) - X(\omega)|> \epsilon_k) \to 0 as n \to \infty we can find a subsequence n_k such that P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \frac{1}{2^k}. and thus
\sum_{k=1}^\infty P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \sum_{k} \frac{1}{2^k} < \infty \,.
By part 2. of Theorem 1.2X_{n_k} converges almost surely to X.
Item 3.: Assume that every subsequence of X_n has a sub-subsequence which converges to X. Fix \epsilon>0 and consider the numerical sequence p_n(\epsilon)=P(|X_n-X| \ge \epsilon). Since this sequence is bounded, by Bolzano-Weierstrass theorem, let p_{n_k} be a convergent subsequence with \lim_{k} p_{n_k} =p. By assumption X_{n_k} has a convergent subsequence X_{n_{k_j}} which converges to X almost surely. This implies, by part 1., that p_{n_{k_j}} converges to 0. This means that for the sequence p_n, every convergent subsequence has a subsubsequence which converges to 0. This implies that p_n converges to 0 and thus X_n converges to X in probability. \quad \square.
Based on this we obtain the following continuity theorem
Theorem 1.4 (Continuity theorem for convergence in probability)
If X_n converges to X almost surely and f is continous function then f(X_n) converge to f(X) almost surely.
If X_n converges to X in probability and f is continous function then f(X_n) converge to f(X) in probability.
Proof. Part 1. is just the definition of continuity.
For part 2. suppose X_n converges to X in probability. Then, by Theorem 1.3, there exists a subsequence X_{n_k} which converges almost surely which implies, by part 1, that f(X_{n_k}) converges to f(X) almost surely.
Now we apply part 3. of Theorem 1.3 to the sequence Y_n= f(X_n). Since X_{n_k} converges in probability, by the previous paragraph, every subsequence f(X_{n_k}) has a convergent subsubsequence whic converges to f(X) a.s. and thus f(X_n) converges to f(X) in probability.
Using a similar argument we show that convergence in probability is preserved under arithmetic operations.
Theorem 1.5 Suppose X_n converges to X in probability and Y_n converges to Y in probability then X_n+Y_n converges to X+Y in probability, X_n-Y_n converges to X-Y in probability, X_n Y_n converges to XY in probability and X-n/Y_n converges to X/Y in probability (assuming that Y_n and Y are almost surely non-zero).
Proof. All the proofs are the same so let us do the sum. We pick a subsequence such that X_n converges to X almost surely along that subseqence. Then we pick a subsubsequence such that both X_n and Y_n converges almost surely along that subsequence. For that subsequence X_n+Y_n converges to X+Y almsost surely. We now apply this argument to subsequence of X_n+Y_n, every such subsequence as a subsubsequence which converges to X+Y almost surely and thus by part 3. of Theorem 1.3X_n+Y_n converges to X+Y almost surely.
We finish this section by showing that we can use weak convergence to turn the space of RV into a complete metric space. We will use the following metric
d(X,Y) = E[ \min\{ |X-Y|, 1\} ]
The choice is not unique, often one will find instead d(X,Y)=E\left[ \frac{|X-Y|}{1 + |X-Y|}\right].
Theorem 1.6 (Convergence in probability and metric)
X_n to converge to X in probability if and only if \lim_{n \to \infty}d(X_n,X) =0.
The space of all measurable RV,
L^0(\Omega, \mathcal{A}, P)=\{ X :(\Omega, \mathcal{A}, P) \to (\mathbb{R},\mathcal{B}) \text{ measurable} \}
with the metric d is a complete metric space for the metric d(X,Y) (as usual we identify RV which are a.s. equal).
Equivalently, for any Cauchy sequence \{X_n\} for convergence in probability there exists a random variable Y such that X_n converges to Y in probbaility.
Proof. It is easy to check that d(X,Y) is a distance.
We also have for \epsilon \in (0,1) and x \ge 0 the inequality
\epsilon i_\epsilon(x) \le \min\{x,1\} \le \epsilon + i_\epsilon(x)
Replacing x by |X_n-X| and taking expectations and n \to \infty shows that
\epsilon P(|X_n-X| \ge \epsilon) \le d(X_n,X) \le \epsilon + P(|X_n-X| \ge \epsilon)
and this proves the second claim.
Finally assume that \{X_n\} is a Cauchy sequence for the metric d. Then \lim_{n,m \to \infty} d(X_n,X_m) =0 or, as we have just seen, equivalently for any \epsilon >0 we can find N such that P(|X_n-X_m| \ge \epsilon) \le \epsilon for n,m\ge N. Choose now \epsilon_k = \frac{1}{2^k} and find corresponding N_k \le N_{k+1} \le \cdots. Setting Y_k=X_{N_k}, this implies that \sum_{k} P(|Y_k-Y_{k+1}|>\epsilon_k) < \infty and thus Y_k converges almost surely to a RV Y.
To conclude we show that X_n converges to Y in probability. Since |X_n-Y| \le |X_n - X_{N_k}| + |Y_k -Y| we have
i_\epsilon(|X_n-Y|) \le i_{\frac{\epsilon}{2}}(|X_n-X_{N_k}|) + i_{\frac{\epsilon}{2}}(|Y_k-Y|) \,.
Taking expectations gives
P(|X_n-Y| > \epsilon) \le P(|X_n-X_{N_k}| > \epsilon/2) + P(|Y_k-Y|> \epsilon/2).
The first goes to 0 as n goes to \infty since the sequence X_n is Cauchy and the second term goes to 0 since we have almost sure convergence. Therefore X_n converges to Y in probability. \quad \square
1.3 Convergence in L^p.
Convergence in L^p simply uses the norm \|X\|_p. Most of the time we use p=1 or p=2.
Definition 1.3 (Convergence in L^p) A sequence of RVs \{X_n\}converges in L^p to a RV X if
\lim_{n \to \infty} E[|X_n-X|^p] =0
or equivalently \lim_{n \to \infty}\|X_n-X\|_p=0.
Remarks:
The limit of a sequence in L^p is unique since \|X-Y\|_p\le \|X-X_n\|_p + \|Y-X_n\|_p by Minkovski inequality.
Note that if X_n converges to X in L^1 then we have convergence of the first moments. Indeed we have
|E[X_n]-E[X]| \le E[|X_n-X|] \quad \text{and} \quad |E[|X_n|] - E[|X|]| \le E[|X_n-X|]
(the second follows the reverse triangle inequality ||x|-|y||\le |x-y|). Therefore E[X_n]\to E[X] and E[|X_n|]\to E[|X].
If X_n converges in L^1 then X_n does not need to converge almost surely. See the sequence Equation 1.2 which also converges to 0 in L^1.
If X_n converges in L^p then X_n converges in probability as well. We have P(|X-X_n| \ge \epsilon) \le E[|X_n-X|^p]/{\epsilon^p} by Markov inequality. In particular, by Theorem 1.3 if X_n converges to X in L^p then there exists a subsequence X_{n_k} which converges almost surely to X.
Conversely convergence in probability does not imply convergence in L^1. Modify the sequence in Equation 1.2 to make it Y_1=X_1, Y_2=2X_2, Y_3=2X_3, Y_4=3X_4, \cdots. This sequence converges in probability to 0 as well! This ensures that E[Y_n]=1 so Y_n does not converge to 0 in L^1. Note also that for any m there are infitely many m \ge n such that E[|X_n-X_m|=2 so the sequence cannot converge.
We prove now a converse which looks a bit like dominated convergence theorem.
Theorem 1.7 (Convergence in L^p versus convergence in probability)
If X_n converges to X in L^p then X_n converges to X in probability.
If X_n converges to X in probability and |X_n|\le Y for some Y \in L^P then X_n converges to X in L^p.
Proof. We already discussed 1. (Markov inequality).
For the converse, since X_n converges in probability to X there exists a subseqeunce X_{n_k} which converges almost surely to X.
Since |X_{n_k}|\le Y we see that |X|< Y and thus X \in L^p.
The sequence a_n =E[|X-X_n|^p], is bounded since, by Minkowski
a_n^{1/p}=\|X-X_n\|_p \le \|X\|_p +\|X_n\|_p \le 2\|Y\|_p \,.
Let a_{n_k} be a convergent subsequence. Then since X_{n_k} converges to X in probability there exists a subsubsequence X_{n_{k_j}} which converge to X a.s Then |X_{n_{k_j}}-X|^p converges almost surely to 0 and |X_{n_{k_j}}-X|^P \le 2^p |Y|^p which is integrable. So by DCT a_{n_{k_j}} converges to 0. This implies that a_n converges to 0. \quad \square.
1.4 Exercises
Exercise 1.1 Show (by a counterxample) that if f is not continuous, convergence of X_n in probability to X does not imply convergence of f(X_n) to f(X) in probability.
Exercise 1.2 Let X_1, X_2, … be independent Bernoulli random variables with P(X_n = 1) = p_n and P(X_n = 0) = 1−p_n.
Show that X_n converges to 0 inprobability if and only if \lim_n p_n=0.
Show that X_n converges to a.s. if and only if \sum_n p_n < ∞.
Hint: Use the Borel Cantelli Lemmas
Exercise 1.3
Suppose X_n converges to X in L^1. Show that \lim_{n} E[X_n]=E[X].
Suppose X_n converges to X in L^2. Show that \lim_{n} E[X_n^2]=E[X^2].
2 The law of large numbers
The law of large numbers is a foundational (and also very intuitive) concept for probability theory. Suppose we are interested in finding the the probability P(A) for some event A which is the outcome of some (random experiment). To do this repeat the experiment n times, for sufficiently large n and an approximate value for P(A) is the proportion of experiments for which the outocme belong to A
P(A) \approx \frac{ \text{ number of times the experiment belongs to } A }{n} \quad \text{ if $n$ is large enough}
This is the basis for the frequentist approach to probability, probability of events are obtained by repeating the random experiment.
2.1 Strong of law of large numbers
Consider a probability space (\Omega, \mathcal{A},P) on which real-valued random variables X_1, X_2, \cdots are defined. We define then the sum
S_n = X_1 + \cdots + X_n
Law of large numbers stands for the convergence of the average \frac{S_n}{n} to a limit. The convergence can be in probability in which case we talk of a weak law of large numbers or almost sure convergence in which case we talk about a strong law of large numbers.
Example: Normal Consider X_1, X_2, \cdots, X_n independent normal RV with mean 0 and variance \sigma^2. Then using the moment generating function we have
E\left[e^{t\frac{S_n}{n}}\right] = E\left[e^{\frac{t}{n}X_1} \cdots e^{\frac{t}{n}X_1} \right] = E\left[e^{\frac{t}{n}X_1}\right] \cdots E\left[e^{\frac{t}{n}X_1} \right] = \left(e^{\frac{\sigma^2}{2} \frac{t^2}{n^2}} \right)^n = e^{\frac{\sigma^2}{2} \frac{t^2}{n}}
We conclude that \frac{S_n}{n} is a normal random variable with variance \frac{\sigma^2}{n}.
By Chebyshev we conclude that P\left( \left|\frac{S_n}{n}\right| \ge \epsilon)\right) \le \frac{\sigma^2}{n \epsilon} and so \frac{S_n}{n} converges to 0 in probability. This is not enough to show almost sure convergence since \sum_{n}\frac{\sigma^2}{n \epsilon} =\infty.
By the Chernov bounds however we have (see the example after ?@thm-chernov) P\left( \left|\frac{S_n}{n}\right| \ge \epsilon \right) \le e^{-n \epsilon^2/2\sigma^2} and since \sum_{n}e^{-n t^2/2\sigma^2} < \infty Borel-Cantelli Lemma (see Theorem 1.2) implies that \frac{S_n}{n} converges to 0 a.s.
Example: Cauchy Consider X_1, X_2, \cdots, X_n independent Cauchy RV with parameter \beta. Using the characteristic function (recall the chararcteristic function of a Cauchy RV is e^{-\beta|t|}) we find that
E\left[e^{it\frac{S_n}{n}}\right]=E\left[e^{\frac{it}{n}X_1} \cdots e^{\frac{it}{n}X_1} \right]
= E\left[e^{\frac{it}{n}X_1}\right] \cdots E\left[e^{\frac{it}{n}X_1} \right] = \left( e^{-\beta\left|\frac{t}{n}\right|} \right)^n = e^{-\beta|t|}
that is {S_n}{n} is a Cauchy RV with same parameter as X_i. No convergence almost sure or in probability seems reasonable here convergence in distribution will be useful here.
We start with a (not optimal) version of the LLN
Theorem 2.1 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i] and variance \sigma^2=\rm{Var}(X_i)< \infty. Then we have
\lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n}
= \mu \quad
\left\{
\begin{array}{l}
\text{almost surely}\\ \text{in probability} \\ \text{ in } L^2
\end{array}
\right.
Proof. The proof is literally the same as what we did for discrete random variables and we shall not repeat it here.
A stronger version exists, only existence of a finite mean is needed.
Theorem 2.2 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i]. Then we have
\lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n}
= \mu \quad
\left\{
\begin{array}{l}
\text{almost surely}\\ \text{in probability}
\end{array}
\right.
Various proofs of this exist. For example we can use a truncation argument of the random variables and argument similar to the previoious theorem workking wiht subsequences. Another more fancy proof use the Martingale convergence theorem.
We prove next that the case \mu=\infty can also be treated.
Theorem 2.3 Suppose X_1,X_2, \cdots are independent indetically distributed non-negative random variables with E[X_i]=+\infty. Then we have
\lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} = +\infty \quad \text{almost surely}
Proof. We use a truncation argument combined with the monotone convergence theorem. Given R>0 set Y_n = \min\{ X_n, R\} which is bounded and thus has finite variance. So by Theorem 2.2 we have, almost surely, for \mu_R =E[ \min\{ X_1, R\}]
\lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R
Since X_n \ge Y_n we have for any R
\liminf_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} \ge \lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R
But as R \nearrow \infty\min\{ X_1, R\} \nearrow Y_1 and thus by the monotone convergent theorem \mu_R =E[ \min\{ X_1, R\}]\nearrow E[X_1]=\infty. This concludes the proof. \quad \square.
2.2 Sample variance
Example: convergence of the sample variance Suppose X_1, X_2, \cdots are independent and identically distributed RV, the the sample variance is given by
V_n =\sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2
After some calculation one can prove that E[V_n]= (n-1)\sigma^2.
We claim that \frac{V_n}{n} \to \sigma^2 almost surely. Indeed we have
\frac{V_n}{n}= \frac{1}{n} \sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2 = \frac{1}{n}\sum_{i=1}^{n} \left( X_i^2 - 2 X_i \frac{S_n}{n} + \frac{S_n^2}{n^2} \right) = \sum_{i=1}^{n} X_i^2 - \left(\frac{S_n}{n}\right)^2
By the law of Large numbers, Theorem 2.2, which we apply to the RV X_1, X_2, \cdots and the RV X_1^2, X^2_2, \cdots (note that E[X_1^2]=\sigma^2 + \mu^2) and by continuity, Theorem 1.4, we have
\frac{V_n}{n} \to \sigma^2 + \mu^2 - \mu^2 =\sigma^2\,.
Code
import randomimport matplotlib.pyplot as plt# Parameters for the exponential distributionlambda_parameter =0.5# Adjust this to your desired rate parameter# Initialize variables to track sample mean and sample variancesample_mean =0sample_variance =0sample_size =0# Number of samples to collectnum_samples =100000# Lists to store data for plottingsample_means = []sample_variances = []for _ inrange(num_samples):# Generate a random sample from the exponential distribution sample = random.expovariate(lambda_parameter)# Update the sample size sample_size +=1# Update the sample mean incrementally sample_mean = sample_mean + (sample - sample_mean) / sample_size# Update the sample variance incrementallyif sample_size >1: sample_variance = ((sample_size -2) * sample_variance + (sample - sample_mean) * (sample - sample_mean)) / (sample_size -1)# Append sample mean and variance to the lists for plotting sample_means.append(sample_mean) sample_variances.append(sample_variance)print(f"Last 20 sample means ={sample_means[-20:]}")print(f"Last 20 sample variances={sample_variances[-20:]}")# Plot the sample mean and sample variance as a function of sample sizeplt.figure(figsize=(8, 12))plt.subplot(211)plt.plot(range(1000, num_samples +1), sample_means[999:])plt.title("Sample Mean vs Sample Size")plt.xlabel("Sample Size")plt.ylabel("Sample Mean")plt.subplot(212)plt.plot(range(1000, num_samples +1), sample_variances[999:])plt.title("Sample Variance vs Sample Size")plt.xlabel("Sample Size")plt.ylabel("Sample Variance")plt.tight_layout()plt.show()
2.3 LLN proof of the Weierstrass approximation theorem
A classical result in analysis is that a continuous function f:[a,b] \to \mathbb{R} can be uniformlu approximated by polynomial: for any \epsilon >0 there exists a polynomila p(x) such that \sup_{x \in [a,b]}|p(x) - f(x)|\le \epsilon. Many proof of this result exists and we give one here based on the Law of Large Numbers although the statement has nothing to do with probability. Without loss of generality, by rescaling, we can take [a,b]=[0,1] and we use polynomial naturally associated to binomial random variables, the Bernstein polynomials.
Theorem 2.4 Let f:[0,1]\to \mathbb{R} be a continuous function. Let f_n(x) be the Bernstein polynomial of degree n associated to f, given by
f_n(x) = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n)\,.
Then we have
\lim_{n\to \infty} \sup_{x \in [0,1]} |f(x)-f_n(x)| =0 \,.
Proof. If X_i are IID Bernoulli with success probability p random then S_n=X_1 + \cdots + X_n is binomial random RV and
E\left[f\left(\frac{S_n}{n}\right)\right] = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n) = f_n(p)
Since \frac{S_n}{n} converges p a.s and in probability we have f\left(\frac{S_n}{n}\right) converges to f(p) and thus taking expectation f_n(p) converges to f(p). We still do need to work harder, though, to establish uniform convergence, in p.
The variance of a binomial is n p(1-p) \le \frac{n}{4} is bounded uniformly in p\in [0,1] and thus by Chebyshev
P\left( \left|\frac{S_n}{n} - p\right| \ge \delta \right) \le \frac{p(1-p)}{n \delta^2} \le \frac{1}{4 n \delta^2}
Since f is continuous on a compact interval then f is bounded with \sup_{x}|f(x)|=M < \infty and f is also uniformly continuous on [0,1]. Given \epsilon >0 pick \delta such |x-y| < \delta \implies |f(x)-f(y)| < \epsilon. We have, for n large enough,
\begin{aligned}
& |f_n(p)-f(p)|=\left| E\left[f\left(\frac{S_n}{n}\right)\right] - f(p) \right| \le E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right|\right] \\
&= E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| \ge \delta} \right]
+ E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| < \delta} \right] \le 2M \frac{1}{4 n \delta^2} + \epsilon \le 2\epsilon
\end{aligned}
This proves uniform convergence. \quad \square.
2.4 Empirical density and Glivenko-Cantelli
Example: sample CDF and empirical measure Suppose X_1, X_2, \cdots are independent and identically distributed RV with common CDF F(t)
F_n(t)=\frac{\# \{ i \in \{1,\cdots,n\}: X_i \le t\}}{n} \to F(t) \quad \text{ almost surely}\,.
\tag{2.1}F_n(t)=F_n(t,\omega) is called the empirical CDF. Note that F_n(t,\omega) is the (random) CDF for the discrete random variable with distribution
\frac{1}{n} \sum_{i=1}^n\delta_{X_i(\omega)}
which is called the empirical distribution. The convergence of F_n(t) to F(t) is is just the law of large number applied to Y_n=1_{X_n \le t} whose mean is E[Y_n]=P(Y_n \le t)=F(t).
A strengthening of the law of large number is that the empirical CDF F_n(t) converges to F(t) uniformly in t.
Theorem 2.5 (Glivenko-Cantelli Theorem) For a RV X with CDF F(t) we have
\sup_{t \in \mathbb{R}}|F_n(t)-F(t)| \text{ converges to } 0 \quad \text{almost surely}.
Proof. We only the prove the case where F(t) is continuous but the proof can be generalized to general CDF by considering the jumps more carefully. The proof relies on the fact that F is increasing which precludes oscillations and control the convergence.
First we show that we can pick a set of probability 1 such that the convergence occurs for all t \in \mathbb{R} on that set. Since a countable union of sets of probability 0 has probability 0, we can pick a set \Omega_0 of probability 1 such F_n(t,\omega) converges to F(t) for all *rational t \in \mathbb{R} and all \omega \in \Omega_0. For x \in \mathbb{R} and rational s,t with s \le x \le t we have
F_n(s,\omega) \le F_n(x,\omega) \le F_n(t,\omega)
and therefore
F(s)\le \liminf_{n}F_n(x,\omega) \le \limsup_{n}F_n(x,\omega) \le F(t)
Since F(t) \searrow F(x) as t \searrow x and F(s) \nearrow F(x) as s \nearrow x we conclude that F_n(t,\omega) \to F(t) for all \omega \in \Omega_0.
We show next that, for any \omega \in \Omega_0, the convergence is uniform in t. Since F is increasing and bounded, given \epsilon>0 we can find t_0=-\infty < t_1 < \cdots < t_m=+\infty such that F(t_j)-F(t_{j-1}) \le \frac{\epsilon}{2}. Using that F_n and F are increasing we have for t \in [t_{j-1},t_j]
\begin{aligned}
F_n(t)-F(t) &\le F_n(t_j) - F(t_{j-1}) \le F_n(t_j)- F(t_j) + \frac{\epsilon}{2}\\
F_n(t)-F(t) &\ge F_n(t_{j-1}) - F(t_{j}) \ge F_n(t_{j-1})- F(t_{j-1}) - \frac{\epsilon}{2}\\
\end{aligned}
We can now pick N_j=N_j(\omega) such that |F_n(t_j)- F(t_j)|\le \frac{\epsilon}{2} if n \ge N_j and therefore if n\ge N=\max_j N_j we have, for all t \in \mathbb{R},
|F_n(t,\omega)-F(t)| \le \epsilon
for all t and all n \ge N(\omega). This that \sup_{t\in \mathbb{R}}|F_n(t)-F(t)| converges almost surely to 0. \quad \square.
Illustration of the Glivenko-Cantelli Theorem (made with Chat GPT)
Code
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import beta# Generate random data from a Beta distributionnp.random.seed(42)true_distribution = beta.rvs(2, 5, size=1000)# Generate empirical distribution functiondef empirical_distribution(data, x):return np.sum(data <= x) /len(data)# Compute empirical distribution function for different sample sizessample_sizes = [10, 100, 1000]x_values = np.linspace(0, 1, 1000)plt.figure(figsize=(12, 8))for n in sample_sizes:# Generate a random sample of size n sample = np.random.choice(true_distribution, size=n, replace=True)# Calculate empirical distribution function values edf_values = [empirical_distribution(sample, x) for x in x_values]# Plot the empirical distribution function plt.plot(x_values, edf_values, label=f'n={n}')# Plot the true distribution functiontrue_cdf_values = beta.cdf(x_values, 2, 5)plt.plot(x_values, true_cdf_values, linestyle='--', color='black', label='True Distribution')# Plot the difference between ECDF and true CDFfor n in sample_sizes: sample = np.random.choice(true_distribution, size=n, replace=True) edf_values = [empirical_distribution(sample, x) for x in x_values] difference = np.abs(edf_values - true_cdf_values) plt.plot(x_values, difference, linestyle='dotted', label=f'n={n} (Difference)')plt.title('Glivenko-Cantelli Theorem with Beta Distribution')plt.xlabel('x')plt.ylabel('Empirical Distribution Function / Difference')plt.legend()plt.show()
2.5 The Monte-Carlo method
The (simple) Monte-Carlo method is a probabilistic algorithm using sums of independent random variables the law of large numbers to estimate a (deterministic) quantity \mu \in \mathbb{R} (or \mathbb{R}^d).
The basic idea is to express \mu as the expectation of some random variable \mu = E [ h(X) ] and then use the law of large numbers to build up an estimator for \mu.
Simple Monte-Carlo Sampling Algorithm: To compute \mu \in \mathbb{R}
Find a random variable h(X) such that \mu= E[h(X)].
I_n = \frac{1}{n} \sum_{k=1}^n h(X_k), where X_k are IID copies of X, is an unbiased estimator for \mu, that is we have
For all n we have E[I_n] = \mu (unbiased).
\lim_{n \to \infty} I_n = \mu almost surely and in probability.
An interesting part is that there are, in general, many ways to find the random variables h(X).
Conversely in many problems the random variable h(X) is given but the expection is too diffcult to compute, so we rely on the LLN to compute \mu.
2.6 Computing \pi with Buffon’s needles
This seems to be the first example of a rejection sampling used to solve a mathematical problem, by Le Comte de Buffon (see Bio in Wikipedia).
A needle of length l is thrown at random on floor made on floorboards of width d and we assume l \le d. We want to compute the probability that the needle does intersect two floor boards.
Denote by X the distance from the center of the needle to the nearest intersection (this is uniformly distributed on [0,\frac{d}{2}]) and by \Theta the acute angle between the needle and an horizontal line (this is uniformly distributed on [0,\frac{\pi}{2}]).
For the needle to intersect we must have x \le \frac{l}{2} \sin(\theta) and thus
P\left(X \le \frac{l}{2} \sin(\Theta)\right) = \int_0^{\frac{\pi}{2}} \int_0^{\frac{l}{2}\sin(\theta)} \frac{2}{d} dx \, \frac{2}{\pi} d\theta = \frac{2l}{d\pi}
So in order estimate \pi you shall throw n needles on the floors at random and \pi \approx \frac{2ln}{d} \frac{1}{ \# \textrm { of needles intersecting two floor boards}}. No random number generator needed….
2.7 Computing \pi with random numbers
Enclose a disk of radius 1 in a square of side length 2 and consider the following Bernoulli random variable X.
Generate 2 independent vectors V_1,V_2 uniformly distributed on [-1,1].
If V_1^2 + V_2^2 \le 1, set X=1, otherwise set X=0.
X is Bernoulli with p=\frac{\textrm{area of the disk}}{\textrm{ area of the square}}=\frac{\pi}{4}
So \frac{4}{n}\sum_{k=1}^n X_i \to \pi with probability 1
Code
import numpy as npN =1000000dart_positions =2* np.random.rand(N, 2) -1# generates numbers in [-1, 1]Ncircle = [0] # start the count with 0 to make our loop-logic easierfor x,y in dart_positions:if x**2+ y**2<1: Ncircle.append(Ncircle[-1] +1) # another dart has fallen in the circleelse: Ncircle.append(Ncircle[-1]) # the dart fell outside of the circle running_estimate = []for n_total, n_circle inenumerate(Ncircle[1:]): # skip the inital 0-count# n_total starts at 0, so we need to add 1 running_estimate.append(4* n_circle / (n_total +1))np.transpose(running_estimate[-20:])