1-Probability models

Math 605, 2023

Luc Rey-Bellet

University of Massachusetts Amherst

2023-12-13

1 Axioms of Probability

The axioms of probability were first formalized by Andrey Kolmogorov in 1933 but probability theory started much earlier.
The Doctrines of Chances by Abraham de Moivre in 1718 is usually considered as the first probability textbook and de Moivre also first proved a version of the central limit theorem in 1733.
The Ars conjectandi by Jacob Bernoulli in 1713 has the first proof of the Law of Large numbers and the subsequent work by Pierre Simon Laplace Théorie analytique des probabilités in 1819 present a formalization of probability theory.
Laplace’s book the concept of conditional probability (“the” key idea in probability) is also presented (first introduced by Thomas Bayes in 1763 in An_Essay_towards_solving_a_Problem_in_the_Doctrine_of_Chances by Thomas Bayes.
The mathematical foundations, at the very bottom, of Probability relies on measure theory developed earlier by the likes of Emile Borel and Henri Lebesgue and many others.
According to a remark attrubuted Marc Kac: probability theory is measure theory with a soul
Probability is the new kid on the mathematics block (after Geometry, Algebra and Analysis) but of course the coolest kid.
Probability theory provides the mathematical foundations and the language for Statistics, Ergodic Theory, Statistical Mechanics, Information Theory, and (a big part of) Machine Learning.

1.1 \sigma-algebras

The state space or sample space \Omega is some abstract space and we denote subsets of \Omega by capital letters A,B,\cdots. We use the language of set theory A^c= E\setminus A, \quad A \cap B, \quad A \cup B, \quad A \setminus B, \quad A \supset B, \quad \emptyset,

Definition 1.1 (partition) A partition of \Omega is a collection of sets, (A_i)_{i=1}^\infty, such that

A_i \cap A_j =\emptyset (pairwise disjoint) for all i \not=j.
\bigcup_{i=1}^\infty A_i = \Omega

Intuition:

\Omega = collection of all possible outcomes of an experiment

A = event = collection of all outcomes compatible with the event A

Write 2^\Omega for the collection of all subsets of \Omega and we denote by calligraphic letter \mathcal{A}, \mathcal{E}, \mathcal{F}, ... collections of subsets of \Omega (that is subsets of 2^\Omega).

We introduce natural collection of subsets

Definition 1.2 (algebra) A collection of set \mathcal{E} is an algebra if

\emptyset \in \mathcal{E} and \Omega \in \mathcal{E}.
\mathcal{E} is closed under complement: A \in \mathcal{E} \implies A^c \in \mathcal{E}.
\mathcal{E} is closed under finite union and intersection: A_1, \cdots, A_n \in \mathcal{E} \implies \begin{array}{c}\bigcup_{i=1}^n A_i \in \mathcal{E} \\ \bigcap_{i=1}^n A_i \in \mathcal{E}\end{array}.

Definition 1.3 (\sigma-algebra) A collection of set \mathcal{A} is an \sigma-algebra if

\emptyset \in \mathcal{A} and \Omega \in \mathcal{A}.
\mathcal{A} is closed under complement: A \in \mathcal{A} \implies A^c \in \mathcal{A}.
\mathcal{A} is closed under countable union and intersection: A_1, A_2, \cdots, \in \mathcal{A} \implies \begin{array}{c}\bigcup_{i=1}^\infty A_i \in \mathcal{A} \\ \bigcap_{i=1}^\infty A_i \in \mathcal{A}\end{array}.

Examples of \sigma-algebras: \mathcal{A} = \left\{ \emptyset, \Omega \right \}, \mathcal{A}=2^\Omega,
\mathcal{A} = \left\{ \emptyset, A, A^c, \Omega \right \}
\mathcal{A} = \left\{ \emptyset, A, B, A^c, B^c, A\cup B, A \cap B, A\setminus B, B \setminus A, A^c \cap B^c, \Omega \right \}

Definition 1.4 (\sigma-algebra generated by \mathcal{C}) If \mathcal{C} is a collection of subsets of \Omega, the \sigma-algebra generated by \mathcal{C}, denoted by \sigma(\mathcal{C}), is the smallest \sigma-algebra containing \mathcal{C}.

Remark: \sigma(\mathcal{C}) always exists since, alternatively, you can think of it as the intersection of all \sigma-algebras which contains \mathcal{C}. Indeed we always have \mathcal{C} \subset 2^\Omega and arbitrary intersections of \sigma-algebras are \sigma-algebras (see Homework for more details).

If the space \Omega has a topology (e.g \Omega = \mathbb{R}) then it natural to pick a \sigma-algebra compatible with open set.

Definition 1.5 (The Borel \sigma-algebra) If \Omega=\mathbb{R} the Borel \sigma-algebras \mathcal{B} is the \sigma-algebra generated by the collection of all open sets (or, equivalently, by the closed sets).
The sets in the Borel \sigma-algebras \mathcal{B} are called Borel sets.

Remark: The Borel \sigma-algebra of \mathbb{R} is strictly smaller than 2^\mathbb{R} (see Math 623 for a proof (existence of non-measurable sets)). The \sigma-algebra 2^\mathbb{R} is too big to useful!

An important characterization of the Borel \sigma-algebra of \mathbb{R} is the following. It will play a big role to characterize random variables

Theorem 1.1 The Borel \sigma-algebra of \mathbb{R} is generated by the countable collection of intervals of the form (-\infty, a] \quad \text{where} \quad a \in \mathbb{Q}

Proof. Let \mathcal{C} be the collection of all open set.

If A is an open set, then A can be written as the union of disjoint intervals A = \cup_{i}(a_i, b_i).
An interval (a,b) can be written as (a,b) = (-\infty,b) \setminus (-\infty,a].
Taking a sequence b_n \in \mathbb{Q} with b_n \nearrow b we have (-\infty,b)= \bigcup_{n}(-\infty, b_n].
Taking a sequence a_n \in \mathbb{Q} with a_n \searrow a we have (-\infty,a]=\bigcap_{n}(-\infty,a_n].

If \mathcal{D} is the collection of interval (-\infty,a], with a \in \mathbb{Q} then we have shown that \mathcal{C} \subset \sigma(\mathcal{D}) and therefore \sigma(\mathcal{C}) \subset \sigma(\mathcal{D}). But since \mathcal{D} consists of closed sets (which are Borel sets) we have \mathcal{D} \subset \sigma(\mathcal{C}) and thus \sigma(\mathcal{D}) \subset \sigma(\mathcal{C}). \quad \square

1.2 Axioms of Probability

Two simple axioms underlie probability theory.

Definition 1.6 (Probability measure) A probability measure defined a on a \sigma-algebra \mathcal{A} is a function P : \mathcal{A} \to [0,1] with the following properties

P(\emptyset)=0 and P(\Omega)=1
(Countable additivity) For any countable collection of pairwise disjoint sets A_i (i.e. A_i \cap A_j = \emptyset for all i\not=j ) we have P\left( \bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i) A Probability space (\Omega, \mathcal{A}, P) consists of the set \Omega a \sigma-algebra \mathcal{A} and a probability measure P.

Remark: As we shall see, requiring finite additivity P\left( \bigcup_{i=1}^n A_i\right) = \sum_{i=1}^n P(A_i) for finite n would not be sufficient.

1.3 Consequence of the axioms

Theorem 1.2 (Properties of probability measures)

Finite additivity: A \cap B=\emptyset \implies P(A\cup B) = P(A) = P(B)
Monotonicity: A \subset B \implies P(A) \le P(B)
Complement: P(A^c)=1-P(A)
Union rule: P(A \cup B)= P(A) +P(B) - P(A \cap B)
Inclusion-exclusion: P\left(\bigcup_{i=1}^N A_i\right)= \sum_{i} P(A_i) - \sum_{i< j} P(A_i \cap A_j) + \sum_{i< j < k} P(A_i \cap A_j \cap A_k) + \cdots +(-1)^{N+1}P(A_1 \cap \cdots A_N)

1.4 Countable additivity and limits

Monotone limits for sets

A_n \searrow A \quad \text{ means } \quad A_1 \supset A_2 \supset A_3 \,\,\text{ and } \,\, A= \bigcap_{n=1}^\infty A_n

A_n \nearrow A \quad \text{ means } \quad A_1 \subset A_2 \subset A_3 \,\,\text{ and } \,\, A= \bigcup_{n=1}^\infty A_n

Theorem 1.3 (Sequential continuity)

Countable additivity implies sequential continuity that is, if A_1, A_2, \cdots \in \mathcal{A} \begin{aligned} A_n \searrow A &\implies P(A_n)\searrow P(A) \\ A_n \nearrow A &\implies P(A_n)\nearrow P(A) \end{aligned}
Finite additivity + sequential continuity implies countable additivity

Proof. Let us assume first countable additivity. If A_1 \subset A_2 \subset A_3 \subset \cdots define the disjoints set B_1=A_1, B_2=A_2\setminus A_1, B_3=A_3 \setminus A_2, \cdots. Then for any N, B_1 \cup \cdots \cup B_N=A_N and \bigcup_{n=1}^\infty A_n = \bigcup_{n=1}^\infty B_n. By countable additivity P(A)=P\left(\bigcup_{n=1}^\infty A_n \right)=P\left(\bigcup_{n=1}^\infty B_n \right)=\sum_{n=1}^\infty P(B_i) = \lim_{N \to \infty}\sum_{n=1}^N P(B_i)=\lim_{N \to \infty}P(A_N).

If A_1 \supset A_2 \supset A_3 \supset \cdots then taking complement we have A_1^c \subset A_2^c \subset A_3^c \subset \cdots and using the above and de Morgan’s law P(A) = P\left(\bigcap_{n=1}^\infty A_n \right) = 1 - P\left(\bigcup_{n=1}^\infty A_n^c \right) = 1 - \lim_{N\to\infty} P\left(\bigcup_{n=1}^N A_n^c\right)=1 - \lim_{N\to\infty} P\left(A_N^c\right) = \lim_{N\to\infty} P(A_N)

For the converse statement suppose A_i are pairwise disjoint and set B_N = \bigcup_{n=1}^N A_n. Then B_N \nearrow B=\bigcup_{n=1}^\infty A_n. Using finite additivity and continuity we find P\left(\bigcup_{n=1}^\infty A_n\right)=P(B)= \lim_{N \to \infty} P(B_N) = \lim_{N \to \infty} \sum_{n=1}^N P(A_n) = \sum_{n=1}^\infty P(A_n) which prove countable additivity.

\quad\square

1.5 More on limits of sets

limsup and liminf of sets: For an arbitrary collection of sets A_n we define the limsup and liminf by

\begin{aligned} \limsup_{n\to \infty} A_n &= \bigcap_{n=1}^\infty \bigcup_{m \ge n} A_m = \left\{ \omega \in \Omega \,; \omega \in A_n \text {for infinitely many } n \right\} \\ \liminf_{n\to \infty} A_n &= \bigcup_{n=1}^\infty \bigcap_{m \ge n} A_m = \left\{ \omega \in \Omega \,; \omega \in A_n \text {for all but finitely many } n \right\} \end{aligned} The fact that the two definitions coincide requires some thoughts (see homework for more details). For example if \omega \in \bigcap_{n=1}^\infty \bigcup_{m \ge n} A_m then \omega \in \bigcup_{m \ge n} A_m for every n. Taking n=1 we find that there exist some k_1\ge 1 such that \omega \in A_{k_1}. Taking next n=k_1+1 we see that there exists k_2 >k_1 such that \omega \in A_{k_2}, and so on. Therefore \omega belongs to infinitely many A_n.

Characteristic function of a set A: 1_A(\omega)= \left\{ \begin{array}{cl} 1 & \omega \in A \\ 0 & \omega \notin A \end{array} \right.

Limits of sets: We say that A_n converge to A if \lim_{n} 1_{A_n}(\omega) = 1_A(\omega) for all \omega.

Theorem 1.4 If A_n converges to A then \lim_{n\to \infty} P(A_n)=P(A).

Proof. If \lim_{n} 1_{A_n}(\omega) = 1_A(\omega) for some \omega then the sequence is either 1 (if \omega \in A) or 0 (if \omega \notin A) for all but finitely many n. Therefore if A_n converges to A this means

A= \limsup_{n}A_n = \liminf_{n} A_n.

Set B_n = \bigcap_{m\ge n} A_m and C_n =\bigcup_{m\ge n} A_m. Then we have B_n \subset A_n \subset C_n and thus, by monotonicity P(B_n) \le P(A_n) \le P(C_n).

Since the sequence B_n is decreasing to \liminf_{n} A_n and and and the sequence C_n is increasing to \limsup_{n} A_n and A= \limsup_{n}A_n = \liminf_{n} A_n we have \lim_{n\to \infty} P(A_n)=P(A). \quad\square.

1.6 Conditional Probability and Independence (see Stat 515)

The intuition behing conditional probabilities is as follows. Suppose you observe that the vent B has occurred (e.g. “it rained yesterday”). How does that influence the probability of another event A (e.g. there was wind yesterday or it is raining today)?

For \omega \in A to be a possible outcome we must have \omega \in A \cap B. So only events of the form A \cap B are relevant. Normalizing the probabilities leads to the following

Definition 1.7 Given B \in \mathcal{A} with P(B)>0, the conditional probability of A given B is defined by P(A | B) = \frac{P(A \cap B)}{P(B)}

If the occurence of B does not influence the probability that A occurs, that is if P(A|B)=P(A) then we will call theses events independent. This means P(A|B) = P(A) \iff P(A \cap B) = P(A) P(B) \iff P(B|A) = P(A).

Definition 1.8 (Independent events)

Two events A and B are independent if P(A \cap B) = P(A) P(B)
A collection of events (A_i)_{i \in I} (I possibly be infinite) if for any finite J \subset I, P(\bigcap_{i \in J}A_i) = \prod_{i \in J}P(A_i)

Theorem 1.5 (Properties of conditional probability)

Independence: If A and B are independent so are A and B^c, A^c and B, and A^c and B^c.
Probability property: If P(B)>0 A \mapsto P(A|B) defines a probability on \mathcal{A}.
Product rule: For any events A_1, A_2, \cdots, A_n P(A_1 \cap \cdots \cap A_n) = P(A_1)\,P(A_2|A_1) \,P(A_3|A_1\cap A_2) \cdots P(A_n|A_1\cap \cdots \cap A_{n-1})
Conditioning: If E_n is a finite or countable partition of \Omega then P(A) = \sum_{n} P(A|E_n) P(E_n).
Bayes rule: If E_n is a finite or countable partition of \Omega then P(E_m|A) = \frac{P(A|E_m) P(E_m)}{\sum_{n} P(A|E_n) P(E_n)}.

Proof of Theorem 1.5.

Independence: P(A \cap B^c)= P(A) - P(A \cap B) = P(A) - P(A)P(B) = (1-P(A))P(B) =P(A^c)P(B).

Probability property: P(\emptyset|B) = \frac{P(\emptyset)}{P(B)}=0 and P(\Omega|B) = \frac{P(\Omega \cap B)}{P(B)}=1. For countable additivity P (\cup_n A_n |B ) = \frac{P \left(\cup_{n} (A_n \cap B) \right) }{P(B)} =\sum_{n} \frac{ P(A_n \cap B)}{P(B)} = \sum_{n}P(A_n|B).

Product rule: Use repeatedly P(A \cap B) =P(B|A) P(A) which is just the definition. Then P(A_1 \cap \cdots \cap A_n)= P(A_n|A_1 \cap A_{n-1}) P(A_1 \cap \cdots \cap A_{n-1}) = \cdots

Conditioning: Use countable additivity P(A)=P(A \cap \cup_{n} E_n) = P( \cup_n (A \cap E_n)) = \sum_{n} P(A \cap E_n) = \sum_{n} P(A|E_n) P(E_n).

Bayes: Use the defintion and conditioning: P(E_m|A) = \frac{P(A\cap E_m)}{P(A)} = \frac{P(A|E_m) P(E_m)}{P(A)} = \frac{P(A|E_m) P(E_m)}{\sum_{n}P(A|E_n) P(E_n)}.

1.7 Take-home messages

\sigma-algebra would not be necessary if we work only with discrete sample space.
But to describe sample space like \mathbb{R} or a countable collection of discrete models (think coin flip) they are necessary (recall that 2^\mathbb{N} has the same cardinality as \mathbb{R}). In the dark corners of real numbers various monsters are lurking that need to be tamed (and then ignored).
At a deeper and more interesting level \sigma-algebra will occur later in conditional expectation, martingales and stochastic processes, they will be use as information-theoretic tool and will describe sets of questions or inquiries you can perform on a model.

1.8 Homework problems

Exercise 1.1

Suppose \{\mathcal{A}_j\} j \in J is an arbitrary collection of \sigma-algebras (J does not need to be countable). Show that the intersection \cap_{j\in J} \mathcal{A}_j is a \sigma-algebra.
Is the union of a \sigma-algebras a \sigma-algebra? Prove or disprove.

Exercise 1.2 Suppose \mathcal{A} is a \sigma-algebra and let B \in \mathcal{A} be some set. Define \mathcal{B} = \{ A \cap B\,;\, A \in \mathcal{A} \}.

Show that \mathcal{B} is a \sigma-algebra.
Show that the conditional probability P(A|B) is a probability on the \sigma-algebra \mathcal{B}.

Exercise 1.3

Prove the properties of probability measures given in Theorem 1.2
Prove also the so-called Bonferronni inequality \sum_{i=1}^n P(A_i) - \sum_{i < j} P(A_i \cap A_j) \le P(\cup_{i=}^n A_i) \le \sum_{i=1}^n P(A_i) - \sum_{i < j} P(A_i \cap A_j) + \sum_{i< j <k} P(A_i \cap A_j \cap A_k)

Exercise 1.4 (limsup and liminf)

Prove that two definitions for limsup and liminf of sets given in Section 1.5 are equivalent.
Show that \limsup_n A_n^c = (\liminf_n A_n )^c
I found online this pretty instructive illustration of limsup and liminf (slightly edited).

A tech company is closing down and firing the set \Omega of all his employess who become beggars and have to live on the street (in this story people never die). The local church decides to start to give out free food for them every day. On the n^{th} day A_n is the subset of fired employees who show up at church to get fed. There are three categories of people:
1. Some of the people eventually get a new job and never show up at the church again.
2. Others are too proud and try be seen around all the time, but they need to eat so they always come back eventually.
3. Lastly there are the people who after trying everything else, eventually give up start to get their food from the church each day.
Express the categories of people in 1., 2., 3. in terms of limsup and liminf? What needs to happen for \lim_n A_n to exist?

Exercise 1.5 Consider the following lottery game: On N lottery tickets (visibly) numbered 1 to N and the numbers 1 to N are randomly assigned hidden under the scratch pad. (Each number comes up exactly once, in other words one is picking a random permutation of the N numbers). If the randomly assigned number match the visible numbers then you win. This lottery has the remarkable property that the probability that nobody wins is essentially independent of N.

Consider the events A_i=\{ \textrm{ticket } i \text{ is a winner}\}.

Compute P(A_i) and P(A_i \cap A_j) for i\not= j using the product rule.
Use the inclusion-exclusion formula to compute the probability that there is at least one winner.
Show that if N is even moderately big (maybe N\ge 5 or 6 will do) this probability that nobody wins is for all practical purpose independent of N and nearly equal to \frac{1}{e}.

Exercise 1.6 (Polya’s urn) An urn contain b blue balls and r red balls. A ball is drawn from the urn, its color is noted, and it is returned to the urn together with d balls of the same color. This process is repeated indefinity. Denote by B_n the event that the n^{th} drawn ball is blue.

Compute the probability that the second ball is blue, i.e. P(B_2)
Compute the probability the first drawn ball is blue given that the second drawn ball is blue?
Show that the probbaility that n ball drawn P(B_n) is independent of n.
Compute the probability that the first drawn ball is blue given that the n subsequent drawn balls are blue.

2 Probability and Random Variables on countable state space

2.1 Probabilities on countable state spaces

If \Omega=(\omega_1, \omega_2, \dots) is countable sample space take as \sigma-algebra \mathcal{A}=2^{\Omega} to the collection of all susbets of \Omega.
Specifying a probability on \mathcal{A} is equivalent to choosing a collection of numbers p_n=P(\{\omega_n\}) which the probability of the event \{\omega_n\} with p_n \ge 0 \quad \text{ and } \quad \sum_{i} p_n =1 and then for any A \subset \Omega we have P(A) = \sum_{\omega \in A} P(\{\omega\})
Countable additivity reduces to the fact that for absolutely convergent series we can freely interchange of summations P( \cup_{i} A_i) = \sum_{\omega \in \cup_{i} A_i} p(\omega) =\sum_{i} \sum_{\omega \in A_i} p(\omega) = \sum_i P(A_i) if the A_i are pairwise disjoint.

We recall some standard probability models

Definition 2.1 (Poisson) A Poisson distribution with parameter \lambda>0 is a probability distribution on \{0,1,2,3, \cdots\} with p_n = e^{-\lambda} \frac{\lambda^n}{n!}.

The Poisson distribution is ubiquitous, in particular because of its relation with the binomial distribution (sometimes called The Law of small numbers see Theorem 2.1). Typical examples are the number of typos in a page, the number of earthquakes hitting a region, the number of radiactive atoms decay, etc…

Definition 2.2 (Pareto or Zeta distribution) A Pareto distribution with decay rate \alpha>0 is a probability distribution on \{1,2,3, \cdots \} with p_n = \frac{1}{\zeta(\alpha+1)} \frac{1}{n^{\alpha+1}} \quad \text{ where } \quad \zeta(s)=\sum_{n=1}^\infty \frac{1}{n^s} \text{ Riemann zeta function }

The Pareto distribution is a model with polynomial tails. It is widely used in economics where polynomial tails often occurs. Classical example is the the wealth distribution in a population, or the size of cities, or the number of casualties in wars. See the Pareto principle (“20% of the population controls 80% of the wealth”) and the examples in this paper

Definition 2.3 (Independent Bernoulli trials) The model is characterized by an integer n (= number of trials) and by a number 0\le p \le 1 (=the probability of success in each trials). We assume that the trials succeed or fails independently of each other. The state space is \Omega=\{0,1\}^n and we write \omega=(\omega_1,\cdots, \omega_n) with \omega_{i} = \left\{ \begin{array}{cl} 0 & \text{ if } i^{th} \text{ trial fails} \\ 1 & \text{ if } i^{th} \text{ trial succeeds} \end{array} \right. If we set \|\omega\|_1=\sum_{i=1}^n \omega_i which counts the number of 1’s in \omega then P(\{\omega\}) = p^{\|\omega\|_1} (1-p)^{n- \|\omega\|_1}

Definition 2.4 (Binomial random variable) The binomial distribution describe the the number of success in n Bernoulli trials. It is a probability distribution on \{0,1,\cdots,n\} with p_k = {n \choose k} p^k (1-p)^{n-k} where {n \choose k}=\frac{n!}{k!(n-k)!} is the number of ways k successes can occur among n trials.

Definition 2.5 (Geometric distribution) The geometric distribution describes when the first succesful trial occurs in a series of Bernoulli trials. It is a probability distribution on \{1,2, 3\cdots,\} with p_n = (1-p)^{n-1}p since we have unsuccesful trial before the first succesful ones.

Definition 2.6 (Negative binomial or Pascal’s distribution) The negative binomial distribution describes when the k^{th} successful trial occurs in a series of Bernoulli trials. It is a probability distribution on \{k, k+1,k+2 \cdots\} with p_n = { n-1 \choose k-1} (1-p)^{n-k} p^k

Sometimes the negative binomial is defined slightly differently and counts the number of failure until the k^{th} success occurs.

2.2 Poisson approximation

If the number of trial n \gg 1 is large and the probability of success if small p \ll 1, then the binomial distribution can be approximated by a Poisson distribution with parameter \lambda = np. A formal mathematical result is the following (proof in the homework)

Theorem 2.1 (The Law of small numbers) Suppose the probability of success, p_n, varies with n in such a way that \lim_{n\to \infty} np_n=\lambda (take p_n=\frac{\lambda}{n}) then we have \lim_{n \to \infty} { n \choose k } p_n^k (1-p_n)^{n-k} = e^{-\lambda} \frac{\lambda^k}{k!}

Example: Birthday problem

There are N students in class. Event A=“at least two students share the same birthday”. . . .
p= probability that one pair of students have the same birthday? It is p=\frac{1}{365}. . . .
Number of trials here is the number of pair of students n = {N \choose 2}= \frac{N(N-1)}{2}. Note:T rials are weakly dependent here. . . .
Poisson approximation \lambda=np={N \choose 2}\frac{1}{365} and so P(A^c) \approx e^{-{N \choose 2} \frac{1}{365}}. . . .
Exact value P(\text{no pair share the same birthday}) = \frac{365}{365} \frac{364}{365} \cdots \frac{ 365 -N+1}{365}

See more in the nice book Understanding Probability by Henk Tijms

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
from scipy.stats import binom

n = 40   # Number of trials
p = 0.1  # Probability of success
lam = n*p # paramter of the Poisson distribution 

# Values to evaluate the PDF at
x = np.arange(0, 15)

# Calculate the PDF of the Poisson distribution
pdfP = poisson.pmf(x, mu=lam)
pdfB = binom.pmf(x, n, p)

# Plot the PDF
plt.figure(figsize=(5,3))  
plt.bar(x, pdfP, color='blue', width=0.5, label=f"Poisson $\lambda$={lam}")
plt.bar(x, pdfB, color='red', alpha=0.5, label=f"Binomial' n={n}, p={p}")
plt.title('Poisson Approximation')
plt.xlabel('k')
plt.ylabel('PDF')
plt.legend()
plt.grid(True)
plt.show()

2.3 Random variable on countable state spaces

You can think of a random variable as extracting a quantity that can vary (hence the name “variable”) by observing outcomes \omega \in \Omega. The probability defined on \Omega implies that this variable is random, hence the name.

Definition 2.7 (Random Variables on discrete state space) Suppose \Omega is countable. A map X: \Omega \to T with target T (e.g. a subset of \mathbb{R} or \mathbb{N} or … ) is called a random variable (RV).
Since \Omega is countable so is its image T'=\{T(\omega), \omega \in \Omega\} \subset T so we could assume T is countable as well.
The distribution of X, also called the law of X is a probability measure on T called P^X, induced from the probbaility P on \Omega and defined as follows P^X(A) = P(\{ \omega\,;\, X(\omega) \in A\}) = P(X^{-1}(A)) =P( X \in A). It particular for j \in T' we have p^X_j = P(X =j) = \sum_{\omega\,;\, X(\omega)=j} p_j

2.4 Sampling with or without replacement

A urn with N balls, b blue balls and r=N-b red balls. We select n balls out of the N balls either with or without replacing balls after choosing them.
Sample space \Omega=\{0,1\}^n with \omega =(\omega_1, \cdots, \omega_n) where \omega_i=0 means getting a red ball and \omega_i=1 means getting a blue ball on the i^{th} selection.
Random variable X(\omega)= \|\omega\|_1 = \sum_i \omega_i which is the number of red balls and take values in \{0,1,\cdots,n\}

Definition 2.8 (Sampling distributions: binomial and hypergeometric)

If we sample with replacement, the probability to get a blue ball is alaways p=\frac{b}{b+r} and we get a binomial distribution p^X_k = P(X=k) = {n \choose k}p^k (1-p)^{n-k}.
If we sample without replacement, then we get a hypergeometric distribution p^X_k = P(X=k) = \frac{{b \choose k}{N-b \choose n-k}}{N \choose n}

2.5 Expectation

Definition 2.9 If \Omega=\{\omega_1, \omega_2, \cdots\} is countable and P is a probabinility measure on \Omega, we define the expectation of a random variable X: \Omega \to \mathbb{R} by E[X] = \sum_{i} X(\omega_i) p_i \,. The expectation is well defined if X \ge 0 (in which case the sum could possible infinite) or if the series is absolutely convergent \sum_{i}|X(\omega_i)| p_i < \infty.

We denote by L^1 = \{ X: \Omega \to \mathbb{R}, \sum_{i}|X(\omega_i)| p_i \le \infty\} the set of random variable with a finite expectation E[X].
L^1 is a vector space and we have E[aX+Y] = a E[X] + E[Y] (linearity of the expectation).

Examples:

Poisson RV: X: \Omega \to \mathbb{N} and P(X =j)=e^{-\lambda}\frac{\lambda^j}{j!}. Then E[X]=\sum_{j=0}^\infty j e^{-\lambda}\frac{\lambda^j}{j!} \sum_{j=1}^\infty e^{-\lambda}\frac{\lambda^j}{(j-1)!}=\lambda \sum_{j=0}^\infty e^{-\lambda}\frac{\lambda^j}{j!} =\lambda
Bernoulli (Indicator) Random variable: Given an event A define X_A: \omega \to \{0,1\} by X_A(\omega) = \left\{ \begin{array}{cl} 1 & \omega \in A \\ 0 & \omega \notin A\\ \end{array}\right. Then E[X_A] = 1 P(A) + 0(1-P(A)) =P(A)
Binomial Random variable: The Binomial RV can be written as sum of n Bernoulli. From this we see immediately that E[X]=np.
Zeta RV: If X is has zeta distribution with parameter \alpha>0 then E[X] = \sum_{n=1}^\infty n \frac{1}{\zeta(\alpha+1)} \frac{1}{n^{\alpha+1}} = \frac{1}{\zeta(\alpha+1)} \sum_{n=1}^\infty \frac{1}{n^{\alpha}} = \left\{ \begin{array}{cl} \frac{\zeta(\alpha)}{\zeta(\alpha+1)} & \alpha >1 \\ +\infty & 0 < \alpha \le 1 \end{array} \right.

Geometric RV: We have with q=1-p E[X]=\sum_{n=1} nq^{n-1} p = p \frac{d}{dq} \sum_{n=0}^\infty q^n = p \frac{d}{dq} \frac{1}{1-q} = p \frac{1}{(1-q)^2} = \frac{1}{p}
Negative binomial RV: Since a negative binomial with paramter k is the sum of k geometric RV we have E[X]=\frac{k}{p}.
Uniform distribution on \Omega=\{1,2,\cdots, N\}: we have p_j=\frac{1}{N} for every j and thus E[X] = \frac{1}{N}(1 + 2 + \cdots +N) = \frac{1}{N} \frac{N(N+1)}{2}=\frac{N+1}{2}
A RV wihtout expectation: Suppose p_n = c_n \frac{1}{n^2} for n \in \mathbb{Z}\setminus \{0\} and p_0=0 and for a suitable normalization c_n.
Then the series for the expecation
\sum_{n=-\infty}^\infty n p_n is undefined because \sum_{n> 0}np_n =\infty and \sum_{n< 0}np_n = -\infty.

2.6 Homework problems

Exercise 2.1 Prove Theorem 2.1

Exercise 2.2

Use a Poisson approximation to compute the probability that at least 3 people in a room of N share the same birthday.
Revisit the lottery problem of Exercise 1.5. Use a Poisson approximation to estimate the probability that there are exactly k winners.
(Optional) Show that for the lottery problem the probability that there are exactly k winners is \frac{1}{k!}\sum_{j=0}^{n-k} \frac{(-1)^j}{j!}.
Hint: If there are exactly k winners then n-k players are not winners (the probability of that event is known).

Exercise 2.3

Prove that the expected value of a hypergeometric random variable is E[X]=n\frac{b}{r+b}. Hint: Write X=X_1+X_2+ \cdots + X_n where X_i is the number of blue ball in the i^{th} trial. Compute first E[X_1] and then argue by symmetry.
In the powerball at each drawing 5 balls are selected at random among 69 white balls numbered 1 to 70 and 1 ball is seleccted among 26 red balls numbered 1 to 26. A powerball ticket costs $2 and consists of selected 5+1 numbers. You get a prize if the balls selected math the winning balls see the prizes here.
- Express the probability of each prize using the hypergeometric distribution.
- All the prizes are fixed except the jackpot obtained for 5 correct white ball plus the red powerball. You will observe that most people will play only when the jackpot is big enough. As a rational mathematician determine the minimal jackpot for which it makes sense to buy a powerball ticket?

Exercise 2.4 Suppose X is a geometric random variable with success parameter p.

Show that P(X> m+n|X>n)=P(X>m)? What does that mean?
Show that E\left[ \frac{1}{X}\right] =\log \left( p^{\frac{p}{p-1}} \right)
Show that E\left[ X(X-1)(X-2) \cdots (X-r+1\right] = \frac{r! (1-p)^{r-1}}{p^r}. Use this to compute E[X^2] and E[X^3].

Exercise 2.5 Supose X is binomial with parameter (n,p).

Find the value j for which P(X=j) is maximal.
Hint: Compute the ratio P(X=j+1)/P(X=j).
Compute the probability that X is even.
Hint: Compute (p-(1-p))^n.

3 Construction of probability measures

3.1 p-systems and d-systems

Definition 3.1 (p-systems and d-systems)

\mathcal{C} is a p-system it is closed under (finite) intersections.
\mathcal{D} is a d-system if
1. \Omega\in \mathcal{D}
2. A,B \in \mathcal{D} and A \supset B \implies A \setminus B \in \mathcal{D}
3. A_1, A_2, \cdots \in \mathcal{D} \textrm{ with } A_n \nearrow A \implies A \in \mathcal{D}

The p stands for product (= intersection) and d stands for Eugene Dynkin who introduced that concept.

It is obvious that a \sigma-algebra is both a p-system and a d-system. The next proposition shows the converse.

Proposition 3.1 \mathcal{E} is a \sigma-algebra if and only if \mathcal{E} is a p-system and a d-system.

Proof. If \mathcal{E} is a p-system and a d-system then \Omega and \empty are in \mathcal{E} and \mathcal{E} is closed under complement. All this follows from properties 1. and 2. for d-system. Furthermore \mathcal{E} is then closed under union since A \cup B= (A^c \cap B^c)^c. Finally to extend this to countable unions for pairwise disjoiont A_i define B_n=\cup_{i=1}^n A_i and use the property 3. of d-systems.

3.2 Monotone Class Theorem

The next theorem is a version of many theorems of the same type in probability and measure theory.

Theorem 3.1 (Monotone Class Theorem) If a d-system contains a p-system \mathcal{C} then it contains the \sigma-algebra generated by \mathcal{C}.

Proof. Consider the smallest d-system \mathcal{D} containing \mathcal{C} (intersections of d-systems are d-sytems). We claim that \mathcal{D} \supset \sigma(\mathcal{C}). Since \sigma(\mathcal{C}) is the smallest \sigma-algebra containing \mathcal{C} it is enough to show that \mathcal{D} is a \sigma-algebra itself. By Proposition 3.1 we only need to show that \mathcal{D} is a p-system.

Fix B \in \mathcal{C} and consider \mathcal{D_1}=\{ A \in \mathcal{D}\,:\, A \cap B \in \mathcal{D} \}.

Note that B belongs to \mathcal{D}. We claim that \mathcal{D_1} is a d-system. Clearly \Omega \in \mathcal{D_1}. Further if A_1 \subset A_2 with both A_1,A_2 in \mathcal{D_1} then (A_2 \setminus A_1) \cap D = (A_2 \cap D) \setminus (A_1 \cap D) which belongs to \mathcal{D}. Similarly if A_n \in \mathcal{D_1} and A_n \nearrow A then (A_n \cap B) \nearrow (A \cap D) and so A \cap D \in \mathcal{D} and so A \in \mathcal{D_1}.
\mathcal{D_1} is thus a d-system and it contains \mathcal{C} since B \in \mathcal{C} and \mathcal{C} is a p-system. Therefore \mathcal{D_1} \supset \mathcal{D} and we have shown that if A \in \mathcal{D} and B \in \mathcal{C} then A \cap B \in \mathcal{D}.

We now define for fixed A \in \mathcal{D} the set \mathcal{D_2}=\{ B \in \mathcal{D}\,:\, A \cap B \in \mathcal{D} \}.

One verifies that \mathcal{D_2}= is a d-system (just like for \mathcal{D_1}) and thus \mathcal{D_2} \supset\mathcal{D}. This proves that \mathcal{D} is a p-system. \quad \square

3.3 Uniqueness of Measures

It is impossible to compute P(A) for all sets. An important appliction of the the monotone class theorem is that knowing the values of P on p-system generating \mathcal{A} determines P uniquely.

Theorem 3.2 (Uniqueness of probability measures) Suppose P and Q are two probability measures on (\Omega,\mathcal{A}). If P(A)=Q(A) for all A in a p-system \mathcal{C} generating \mathcal{A} then P=Q.

Proof. We know that P(A)=Q(A) for all A \in \mathcal{C} and . Let us consider \mathcal{D} = \left\{ B \in \mathcal{A}\,:\, P(A)=Q(A)\right\}.

Clearly \mathcal{D} \supset \mathcal{C} so to use the Monotone Class Theorem we need to show that \mathcal{D} is a d-system.

Since P(\Omega)=Q(\Omega)=1 then \Omega \in \mathcal{D} and so property 1. holds.
For property 2. suppose A,B \in \mathcal{D} with A \supset B then B \setminus A \in \mathcal{D} since P( B \setminus A) = P(B) - P(A) = Q(B)-Q(A) = Q(B \setminus A)
For property 3. if \{A_n\} \subset \mathcal{D} and A_n \nearrow A, Then P(A_n)=Q(A_n) for all n and by sequential continuity (see Theorem 1.3) they must have the same limits and thus P(A)=Q(A) and so A \in \mathcal{D}.

Corollary 3.1 If two probability P and Q coincide on the sets of the form (-\infty,a] then they are equal.

4 Measurable maps and random variables

4.1 Motivation

Given a probability space (\Omega, \mathcal{A}, P) we think of A \in \mathcal{A} as an event and P(A) is the probability to the event A occurs. Think of this an “observation”: how likely is it that the A occurs.
A random variable is a more general kind of observation. Think for example that you are performing some measurement: to an outcome \omega \in \Omega you associate e.g. number X(\omega) \in \mathbb{R}. It could also be a vector or even some more general object (e.g. a probability measure!)
Consider another state space (F,\mathcal{F}) (often we will take (\mathbb{R}, \mathcal{B}) where \mathcal{B} is the Borel \sigma-algebra) and a map X: \Omega \to F We will want to compute P(\{\omega, X(\omega) \in A\})= P(X \in A) = P(X^{-1}(A)) for some A \in \mathcal{F}.
The notation X^{-1}(A) =\{ \omega\,:\, X(\omega)\in A\} is for the inverse image and for this to make sense we will need X^{-1}(A) \in \Omega.
All of this motivates the following definitions.

4.2 Measurable functions and random variables

Given a function f:E \to F and B \subset F we write f^{-1}(B)= \left\{ x \in E \,;\, f(x) \in B \right\} for the inverse image. The following properties are easy to verify

f^{-1}(\emptyset)=\emptyset
f^{-1}(A\setminus B) = f^{-1}(A) \setminus f^{-1}(B)
f^{-1}(\bigcup_{i} A_i)=\bigcup_{i} f^{-1}\left(A_i\right)
f^{-1}(\bigcap_{i} A_i ) = \bigcap_{i} f^{-1}\left(A_i\right)

Definition 4.1 (Measurable and Borel functions) Given measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}), a function f: E \to F is measurable (with respect to \mathcal{E} and \mathcal{F}) if f^{-1}(B) \in \mathcal{E} \textrm{ for all } B \in \mathcal{F} \,. If F=\mathbb{R} (equipped with the Borel \sigma-algebra \mathcal{B}) a measurable functions is often called a Borel function.

Definition 4.2 (Random variable) A random variable is a measurable function X: \Omega \to F from a probability space (\Omega,\mathcal{A},P) to some measurable space (F,\mathcal{F}).
Convention: If F=\mathbb{R} then we always take the Borel \sigma-algebra.

Remarks:

Using the letter X for a random variable is standard convention from elementary probability.
The term “random variable” is maybe a bit unfortunate but it is standard. The word “variable” means we have a function and the word “random” means it is defined on some probability space,
Compare this to the definition of continuity. A function is continuous if, for all open set, f^{-1}(O) is open.
We just say measurable if there is no ambiguity on the choice of \mathcal{E} and \mathcal{F}.

Fortunately it is enough to check the condition for a few sets

Proposition 4.1 f: E\to F is measurable with respect to \mathcal{E} and \mathcal{F} if and only if f^{-1}(B) \in \mathcal{E} \quad \textrm{ for all } \quad B \in \mathcal{C} where \mathcal{C} generates \mathcal{F} (i.e. \sigma(\mathcal{C})=\mathcal{F}).

Proof. Consider the family of sets \mathcal{D} = \left\{ B \in \mathcal{F}\,:\, f^{-1}(B) \in \mathcal{E} \right\} We now that \mathcal{D} \supset \mathcal{C} and that \sigma(\mathcal{C}) = \mathcal{F}.

To conclude it is enough to show that \mathcal{D} is a \sigma-algebra becuase if this true \mathcal{D} \supset \mathcal{C} implies \mathcal{D} \supset \sigma(\mathcal{C})=\mathcal{F}.

Showing that \mathcal{D} is a \sigma-algebra is easy using the rules for inverse images in Section 4.2.

Corollary 4.1 A function from (E,\mathcal{E}) to \mathbb{R},\mathcal{B} is measurable if and only if f^{-1}((-\infty, a]) = \{x\in E\,:\, f(x) \le a\} \in \mathcal{E} that this the level sets of the function f need to be measurable sets

4.3 Operations on measurable functions

Composition of functions f: E \to F, \quad \quad g: F \to G, \quad \quad g\circ f: E \to G

Like continuity is preserved by composition so is measurability.

Theorem 4.1 (Composition preserves measurability) If f:E \to F is measurable (w.r.t. \mathcal{E} and \mathcal{F}) and g:F \to G is measurable (w.r.t. \mathcal{F} and \mathcal{G}) then the composition h = g \circ f is measurable (w.r.t \mathcal{E} and \mathcal{G}).

Proof. If C \in \mathcal{G} then (g \circ f)^{-1}(C) = f^{-1}( g^{-1}(C)). By the measurability of g, g^{-1}(C) \in \mathcal{F} and so by the measurability of f, f^{-1}( g^{-1}(C)) \in \mathcal{E}.

Given a function f:E \to \mathbb{R} we define positive/negative parts
f_+ = f \vee 0, \quad f_- = -( f \wedge 0) \quad \implies \quad f=f_+-f_- ,\quad |f|=f_++f_-

Theorem 4.2 f: E \to \mathbb{R} is measurable iff and only if f_+ and f_- are measurable.

Proof. It is enough to consider sets of the form \{x, f(x)\le a\}. Proof in your homework.

4.4 Simple functions

Definition 4.3 (Simple functions)

Given a set A \in \mathcal{E}, the indicator function 1_A is defined as 1_A(x) = \left\{ \begin{array}{cl} 1 & \textrm{ if } x \in A \\ 0 & \textrm{ otherwise} \end{array} \right.
A simple function f is a function of the form f(x) = \sum_{i=1}^n a_i 1_{A_i}(x) for some finite n, real numbers a_i, and measurable sets A_i.

Remarks

The A_i are not necessarily disjoint.
A function is simple if and only if it takes finitely many different values.
The decomposition is not unique.

Definition 4.4 A simple function is in canonical form if f(x)= \sum_{i=1}^m b_i 1_{B_i}(x) where b_i are all distinct and (B_i)_{i=1}^m form a partition of E.

Remark: One can always rewrite a simple function in canonical form if needed. Just make a list of the values the function takes b_1, b_2, \cdots, b_m and set B_i=\{x, f(x)=b_i\}.

Proposition 4.2 If f and g are simple function then so are
f+g,\quad f-g, \quad fg, \quad f/g, \quad f\vee g, \quad f\wedge g

Proof. The simplest way to see this is to note that each of these functions takes at most finitely many values if f and g does and therefore they must be simple functions.

4.5 Supremum, infimum, limits

As we see next measurability is preserved by basic operations, in particular taking limits.

Refresher on \limsup and \liminf of sequences: Recall the definitions of \liminf and \limsup for sequences of real numbers (they always exists if we allow the values \pm \infty.) \liminf_n a_n = \sup_{n} \inf_{m \ge n} a_m = \lim_{n} \inf_{m \ge n} a_m = \textrm{ smallest accumulation point of } \{a_n\} \limsup_n a_n = \inf_{n} \sup_{m \ge n} a_m = \lim_{n} \sup_{m \ge n} a_m = \textrm{ largest accumulation point of } \{a_n\}

\lim_n a_n \textrm{ exists } \iff \liminf_n a_n = \limsup_n a_n

We have then

Theorem 4.3 Suppose f_n: E \to \overline{\mathbb{R}}, n=1,2,\cdots is a sequence of measurable functions (with respect to \mathcal{E} and the Borel \sigma-algebra). Then the functions \inf_n f_n, \quad \quad \sup_n f_n\,,\quad \quad \liminf_n f_n\,, \quad \quad \limsup_n f_n, are measurable.
If f=\lim_{n}f_n exsts then f is measurable

Proof.

Let us write g=\sup_{n} f_n. It is enough to check that \{g \le a\} is measurable for any a. We have
\{ g \le a \} =\{ f_n \le a \text{ for all } n\} = \bigcap_{n} \{ f_n \le a\}\,. So \inf_n f_n is measurable if each f_n is measurable.
For g=\inf_{n} f_n we could use that the Borel \sigma-algebra is generated by the collection \{ [a,+\infty) \,:\, a \in \mathbb{R}\} and \{ g \ge a \} =\{ f_n \ge a \text{ for all } n\} = \bigcap_{n} \{ f_n \ge a\}\,.
Since \limsup and \liminf are written in terms of \inf and \sup they do preserve measurability.
If f=\lim_n{f_n} exists then f=\lim_n{f_n} = \limsup_n f_n = \liminf_n f_n and thus is measurable.

4.6 Approximation by simple functions

The following theorem is very important, because it reduces many a computation about measurable function to a computation about a simple function and then taking a limit. In that context one also uses all the time that any measurable f is the difference of two positive measurable functions.

Theorem 4.4 (Approximation by simple functions) A nonnegative function f:E\to \mathbb{R}_+ is measurable \iff f is the limit of an increasing sequence of positive simple functions.

d_n = \sum_{k=1}^{n2^n} \frac{k-1}{2^n} 1_{ [\frac{k-1}{2^n},\frac{k}{2^n}) } + n 1_{[n,\infty)} Simple function, right continuous, d_n(x) \nearrow x on [0,\infty)

Proof. It is not difficult to see that that the function d_n given in the previous page is increasing (due to the dyadic decomposition) and d_n(x) \nearrow x as n \to \infty since if x \in [\frac{k-1}{2^n},\frac{k}{2^n}) then |x-d_n(x)| \le \frac{1}{2^n}.

Let f be a non-negative measurable function then the function g_n = d_n \circ f is a measurable functions (as a composition of measurable functions) and it is a simple function because d_n \circ f takes only finitely many values. Since d_n is increasing and f(x)\ge 0, d_n(f(x)) \nearrow f(x). \quad \square

Corollary 4.2 (Approximation by simple functions) A function f:E\to \mathbb{R} is measurable if and only if it can be written as the limit of sequence of simple functions.

Proof. Write f=f_+-f_- and apply Theorem 4.4 to f_\pm. \quad \square

Theorem 4.5 Suppose f and g are measurable then f+g, \quad f-g, \quad fg, \quad f/g ( \text{ if } g(x) \not=0 ) are measurable

Proof. Homework.

4.7 Extended real-valued function

Write \overline{\mathbb{R}} = \mathbb{R} \cup \{-\infty, \infty\}.
Often it is useful to consider function which are allowed to take values \pm \infty.
The Borel \sigma-algebra on \overline{\mathbb{R}} consists of all sets of the form A, A \cup \{-\infty\}, A \cup \{\infty\}, A \cup \{-\infty, \infty\}.
This Borel \sigma-algebra is generated by the intervals of the form \{[-\infty, r]\}.
All properties of measurable functions on f: E \to \mathbb{R} extend to functions f: E \to \overline{\mathbb{R}}: approximation by simple functions, supremeum, infimum, etc…
We will use all this whenever we need it.

4.8 Homework problems

Exercise 4.1 Show that f is measurable if and only if f+ and f_- are measurable.

Exercise 4.2 Suppose f: \mathbb{R} \to \mathbb{R} (both equipped with Borel \sigma algebra) is continuous. Show that f is measurable.
Remark: This also holds for any continuous function between arbitrary topological spaces.

Exercise 4.3

Suppose f: \mathbb{R} \to \mathbb{R} (both equipped with Borel \sigma algebra) is a right-continuous step function, if there exists a (finite or countable) collection of intervals I_n=[t_n, s_n) such that f is constant on I_n and \cup_n I_n = \mathbb{R}. Show that such a function is measurable.
A function f: \mathbb{R} \to \mathbb{R} is right continuous if f(x_n)\to f(x) for any decreasing sequence x_n \searrow x and this holds for every x. Show that such a function is measurable.
Hint: Set c_n = \sum_{k=1}^{\infty} \frac{k}{2^n} 1_{ [\frac{k-1}{2^n},\frac{k}{2^n})} and f_n =f \circ c_n.

Exercise 4.4 Suppose f: \mathbb{R} \to \mathbb{R} is increasing. Show that f is measurable.

Exercise 4.5 Given two measurable function f,g from (E,\mathcal{E}) to (\mathbb{R}, \mathcal{B}). Show that the sets \{f \le g \}, \quad \{f < g \}, \quad \{f = g \},\quad \{f\not =g\} are all measurable.

Exercise 4.6 Suppose (E, \mathcal{E}) and (F,\mathcal{F}) are two measurable spaces. A (measurable) rectangle in E\times F is a set of the form A \times B \quad A \in \mathcal{E}, B \in \mathcal{F}. The product \sigma-algebra \mathcal{E} \otimes \mathcal{G} is defined as the \sigma-algebra generated by all measurable rectangles.

Suppose f: E \to F is measurable (with respect to \mathcal{E} and \mathcal{F}) and g: E \to G is measurable (with respect to \mathcal{E} and \mathcal{G}). Show that the function h: E \to F\times G given by h(x)=(f(x),g(x)) is measurable (with respect to \mathcal{E} and \mathcal{F}\otimes \mathcal{G}).
Suppose f: E \times F \to G is measurable (with respect to \mathcal{E}\otimes \mathcal{F} and \mathcal{G}). For any fixed x_0\in E define the section of f as the function h: F \to G \quad \text{ with } h(y)=f(x_0,y) Show that h is measurable.
Hint: Show first that the map g:Y \to X\times Y given by g(y)=(x_0,y) is measurable.

5 Distribution functions and quantile functions

5.1 Random variables

Let us apply what we have learned in the last sections to random variables X : \Omega \to \mathbb{R} where (\Omega, \mathcal{A}, P) is a probability space.

Theorem 5.1 Suppose f: E \to F is a measurable map between the measurable spaces (E, \mathcal{E}) and (F,\mathcal{F}) and P a probability measure on (E, \mathcal{E}). Then

f^{-1}(F)= \left\{ f^{-1}(B), B \in \mathcal{F} \right\} is a \sigma-algebra, in general a sub \sigma-algebra of \mathcal{E}.
P\circ f^{-1}(B) which is defined has P\circ f^{-1}(B) = P( f^{-1}(B)) = P(\{x\,:\, f(x) \in B \}) is a probability measure on (F,\mathcal{F}).

Proof. Check the axioms.

Definition 5.1 (Image of a measure) The measure P \circ f^{-1} is called the image of the measure P under f. Various other notations are used (such as f_\#P, etc…)

Adding some terminology

Definition 5.2 (The \sigma-algebra generated by a random variable X) Given a random variable X : \Omega \to \mathbb{R} defined on the probability space (\Omega, \mathcal{A}, P), the \sigma-algebra generated by a random variable X is the \sigma-algebra X^{-1}(\mathcal{B}) \subset \mathcal{A}.

The interpretation is that this \sigma-algebra contains all the “information” you can extract from the probability measures P simply by using the random variable X. This will play an increasingly important role in the future!

Definition 5.3 (Distribution of a random variable X) Given a random variable X : \Omega \to \mathbb{R} defined on the probability space (\Omega, \mathcal{A}, P), the distribution of the random variable X is the probability measure P^X given by P^X \equiv P\circ X^{-1} defined on (\mathbb{R}, \mathcal{B}). That is we have
P^X(B) = P(X \in B).

5.2 Cumulative distribution function

By Corollary 3.1, probability on \mathbb{R} are uniquely defined by their values on the intervals (-\infty,x], this justify the following definition

Definition 5.4 (Cumulative distribution function) The cumulative distribution function (CDF) of a random variable X is the function F: (-\infty,\infty) \to [0,1] defined by F_X(t) = P\{ X \le t\} = P^X((-\infty,t])

Theorem 5.2 (Properties of CDF) If the function F(t) is the CDF for some random variable X, then F has the following properties

F is increasing.
\lim_{t \to -\infty} F(t)=0 and \lim_{t \to +\infty} F(t)=1
F is right-continuous: for every t, F(t)= F(t+)\equiv \lim_{s \searrow t} F(s).

Proof. Item 1. is the monotonicity property for the probability measure P^X. Item 2. follows from sequential continuity and from the fact that (-\infty,t] \searrow \emptyset as t \searrow -\infty and so F(t) \searrow P^X(\emptyset) =0. A similar argument works for t \nearrow \infty. Item 3. follows also from sequential continuity since as s \searrow t, (-\infty,s] \searrow (-\infty,t].

Remarks:

Note that F is in general not (left)-continuous. Indeed if s \nearrow t then (-\infty, s) \nearrow (-\infty,t) and P^X( (-\infty,t] )=P^X( (-\infty,t) ) + P^X( \{t\}). We denote the left limit by F(t-).
One can compute probabilities using the CDF. For example
- P( a < X \le b ) = F(b)-F(a)
- P( a \le X \le b ) = F(b) - F(a-)
- P( X=b ) = F(b)-F(b-)

A atom for a probability measure P on a set \Omega is an element \omega \in \Omega such that P(\{\omega\})>0.
The distribution P^X of the random variable X has atoms whenever the CDF is discontinuous (i.e. F_X(t-)\not=F_X(t)).
The distribution P^X of the random variable X has at most countably many atoms. (Why? see homework)
A discrete random variable X taking values \cdots < \omega_{-1} < \omega_0 < \omega_1 < \cdots has a purely atomic distribution P^X.
The CDF is piecewise constant and we have F_X(t)= \sum_{n \,:\, \omega_n \le t } P(\{\omega_n\})

Another way to define a CDF is to use a PDF (=probability density function).

Definition 5.5 (Probability density function) A probability density function (PDF) is a function f: \mathbb{R} \to \mathbb{R} such that

f(t) \ge 0, f is non-negative
\int_{-\infty}^\infty f(t) dt = 1, f is normalized

The corresponding CDF is then given by the integral F(t)= \int_{-\infty}^t f(x) dx

For now think of the integral as a Riemann integral (e.g. f is piecewise continuous). We will revisit this later when equipped with better integration tools. Many of the classical distributions in probability are given by densities. Here are some examples which will come back.

Examples of PDF:

Uniform RV on [a,b]: This random variable takes values uniformly distributed in the interval [a,b]. f(x)= \left\{ \begin{array}{cl} \frac{1}{b-a} & a \le x \le b \\ 0 & \text{otherwise} \end{array} \right. \quad \quad F(x)= \left\{ \begin{array}{cl} 0 & x \le a \\ \frac{x-a}{b-a} & a \le x \le b \\ 1 & x \ge b \end{array} \right.
Exponential RV with parameter \beta:

f(x)= \left\{ \begin{array}{cl} 0 & x \le 0 \\ \beta e^{-\beta x} & x \ge 0 \end{array} \right. \quad \quad F(x)= \left\{ \begin{array}{cl} 0 & x \le 0 \\ 1 - e^{-\beta x} & x \ge 0 \\ \end{array} \right.

Gamma RV with parameters (\alpha,\beta):
Weibull distribution with parameters (\alpha,\beta):
Normal distribution with parameters (\mu, \sigma^2): The normal distribution has parameter \mu \in \mathbb{R} f(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \quad \quad F(x)=\int_{-\infty}^x f(t) \, dt
Log-normal distribution parameters (\mu, \sigma^2):
Laplace distribution with parameters (\alpha,\beta): This is a 2-sided and shifted version of the exponential distribution. f(x)= \frac{\beta}{2} e^{-\beta|x-\alpha|}
Cauchy distribution with parameters (\alpha,\beta): This is an example of distribution without a finite mean f(x)= \frac{1}{\beta \pi} \frac{1}{1+(x-\alpha)^2/\beta^2} \quad \quad F(x) =\frac{1}{\pi} \arctan\left( \frac{x-\alpha}{\beta}\right) + \frac{1}{2}
Pareto distribution with paramters (x_0,\alpha):

5.3 Random variables with mixed distribution.

It easy to build a random variable whose distribution is neither discrete nor continous.

Example: Flip a fair coin. If the coins lands on tail you win a prize X uniformly distributed on [0,1] and if the coins lands on tail you loose. Then X has an atom at 0 and F(x)= \left\{ \begin{array}{cl} 0 & x < 0 \\ \frac{1}{2} + \frac{1}{2} x & 0 \le x < 1 \\ 1 & x \ge 1 \end{array} \right.

Definition 5.6 (Mixtures of Random variables) Suppose X_1, X_2, \cdots, X_m are random variables with CDF F_{X_1}(t) and \alpha=(\alpha_1, \cdots, \alpha_m) is such that \alpha_i \ge 0 and \sum_{i=1}^m \alpha_i =1. Then \sum_{i=1}^m \alpha_i F_{X_i}(t) is a CDF of a random variable X which is called the (\alpha_1, \cdots, \alpha_m) mixture of X_1, X_2, \cdots, X_m.

In the previous example we had a (1/2,1/2) mixture of X_1=0 (a discrete RV) and X_2 a uniform RV on [0,1] (a continuous RV).

5.4 Devil’s staircase

We construct here a CDF with remarkable properties

F(t) has no discontinuities (no atoms)
F(t) does not have a density, that is F(t) cannot be written as F(t)=\int_0^x f(t) dt.

The construction is based on the Cantor set and F is defined iteratively.

Set F_0(t)=t
Define the function F_1 to be equal to \frac{1}{2} on [1/3, 2/3] continuous and linear [0,1] with F(0)=0 and F(1)=1. Then we have |F_1(t)-F_0(t)|< \frac{1}{2}.
In the second step, let F_2 to be equal to \frac{1}{4} on [1/9, 2/9], unchanged on [1/3, 2/3], \frac{3}{4} on [1/9, 2/9], continuous and piecewise linear [0,1] with F(0)=0 and F(1)=1. We have |F_2(t)-F_1(t)|< \frac{1}{4}.
Repeat the procedure now on the interval [1/27, 2/7], [7/27, 8/27], [19/27, 20/27], [25/27, 26/27]….
It is not diificult to see, by induction, that |F_{n}(t)-F_{n-1}(t)| \le \frac{1}{2^{n}} and thus the sequence F_n converges uniformly to a continuous function F(t) which is increasing on [0,1]

The function F(t) is CDF in good standing. We have P([1/3,2/3])=0 as well as P([1/9,2/9])=P([7/9,8/9])=0$ and so on. In particular there are 2^{n-1} intervals of lengths 3^n whose probability vanishes. The total lenghts of all the interval on which the probability vanish is \frac{1}{3} + 2 \times \frac{1}{9} + 4 \frac{1}{27} =\sum_{n=0}^\infty \frac{2^{n-1}}{3^n} = 1.

A random variable X with CDF F(t) is neither continuous (in the sense of having a density), nor discrete.

5.5 Quantile functions

Intuitively a p-quantile for a RV X, where p \in (0,1), is a value t \in \mathbb{R} where the probability that F_X(t)=P(X\le t) reaches (or crosses over) p. For p=\frac{1}{2} it is usually referred to as the median. More formally

Definition 5.7 (Quantiles of a RV X.) For p \in (0,1), a p-quantile for the RV X is a value t\in \mathbb{R} such that P(X < t)=F_X(t-) \le p \quad \text { and } \quad P(X\le t) =F_X(t) \ge p

Remark: Various cases are possible

a is the unique p-quantile for p (F_X is strictly increasing at a)
b is the unique q-quantile (but there is an whole interval of q which share the same quantile b!).
The interval [c,d] are all r- quantiles (because F_X is locally constant).

We now make a choice to make it unique (other conventions occur in the literature).

Definition 5.8 (Quantile function for a random variable X) For a RV X with CDF F(t) we define the quantile function of X, Q:[0,1] \to \mathbb{R} as Q(p) = \min\{ t\,:\, F(t) \ge p \} with the convention that \inf \emptyset = + \infty

Remark:

Q is well defined since F being increasing and right-continuous implies that \{t\,:\,F(t)\ge p\}=[a,\infty)
and thus the mimimum exists.
Q(p) is a p-quantile since if s=Q(p) then F(s)\ge p and, for any t< s, F(t)<p. Therefore F(s-)\le p. In fact this shows that Q(p) is the smallest p-quantile of X.
If we had picked \widetilde{Q}(t)=\inf \{ t\,:\, F(t) > p \} this would have given us the largest p-quantile (a fine choice as well).

Theorem 5.3 (Properties of the quantile function) The quantile function Q(p) satisfies the following properties

Q(p) is increasing.
Q(F(t)) \le t.
F(Q(p)) \ge p.
Q(p-)=Q(p) and Q(p+) exists. That is Q is left continuous.

Proof.

If p \le q then F increasing implies that \{t\,:\, F(t) \ge q \} \subset \{t\,:\, F(t) \ge p \} and this implies that Q(q)\ge Q(p).
By definition Q(F(t)) is the smallest s such that F(s)\ge F(t). Thus Q(F(t))\le t.
Q(p) is a value s such that F(s)\ge p and thus F(Q(p))\ge p.
This holds because F is right-continuous.

Flat portions of F(t) become jump for Q(p) and vice-versa.

5.6 Application: Construction of probability measures on \mathbb{R}

The most important property of quantile is the following property which shows that Q is a form of functional inverse for F.

Theorem 5.4 We have Q(p) \le t \iff p \le F(t)

Proof.

If Q(p) \le t then since F is increasing F(Q(p)) \le F(t). But by Theorem 5.3, item 2. F(Q(p))\ge p and thus p \le F(t).
Conversely if p \le F(t) then, since Q is increasing, Q(p) \le Q(F(p)) \le p where the last inequality is from Theorem 5.3, item 3.

\quad \square.

We turn next to constructing all probabilities on \mathbb{R}. To do this we first need to construct at least one.

Theorem 5.5 (Lebesgue measure on [0,1]) There exists a unique probability measure P_0 on [0,1] with its Borel \sigma-algebra such that P([a,b])=b-a The measure P_0 is the distribution of the uniform random variable on [0,1] with PDF f(t)= \left\{ \begin{array}{cl} 1 & 0 \le x \le 1 \\ 0 & \text{otherwise} \end{array} \right. and CDF F(t) =\left\{ \begin{array}{cl} 0 & x\le 0 \\ x & 0 \le x \le 1 \\ 1 & x \ge 1 \end{array} \right.

Proof. Go and take Math 623….

Equipped with this we can now prove

Theorem 5.6 Any probability measure P on \mathbb{R} has the form P= P_0 \circ Q^{-1} where P_0 is the Lebesque measure on [0,1] and Q is the quantile function for F(t)=P((-\infty,t]).

Proof. By definition of the image measure (see) Theorem 5.4), P is a probaility measure, and from the fact that P_0( [0,a])=a we get \begin{aligned} P_0 \circ Q^{-1}( (-\infty, t]) & = P_0( \{p\,:\, Q(p) \le t\} ) \\ & = P_0( \{p\,:\, p \le F(t) \}) \\ & = F(t) \end{aligned} and we are done since the CDF determines the measure P uniquely. \quad \square

Another way to interpret this result is that we have constructed a probability space for any RV with a given CDF. Namely we constructed a probability space (here (\Omega,\mathcal{A},P)=([0,1],\mathcal{B},P_0)) (here P_0) is the Lebesgue measure on [0,1] and a map X=Q (the quantile function) with Q:[0,1] \to \mathbb{R}.

5.7 Simulation

In computers are built-in random number generators which generate a uniform RV on [0,1], that a RV whose distribution is P_0.

Inverse method to generate Random Variables:

To generate a RV X with PDF F(t):

Generate a random number U. If U=u
Set X=Q(u) where Q is the quantile function for X.

Example:

If X has an exponential distribution, then F(t)= \int_0^t \lambda e^{-\lambda s} ds = 1-e^{-\lambda t} and Q(p)= -\frac{1}{\lambda} \ln(1 -p)
If X is uniform on \{1,2,\cdots,n\} then the quantile function is the function Q(p)=\lceil np\rceil. Recall \lceil x\rceil is the smallest integer equal or greater than x.
If X is a normal RV then the CDF is F(t)=\int_{-\infty}^t \frac{e^{-x^2/2}}{\sqrt{2\pi}} dx. The quantile Q=F^{-1} has no closed form, but there exists excellent numerical routine to compute it. This can be used to generate normal random variables.

The inverse methods has its limitation and we will learn other simulation methods later on.

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import ndtri # quantile for the normal RV

uniform = np.random.rand(1000000)  # generate random numbers
dataexponential = - np.log(1-uniform) # quantile for the exponential
datanormal =  ndtri(uniform)   # quantile for the normal 
datadiscreteuniform10 = np.ceil (10*uniform) # quantile for uniform

# Create a histogram
hist, bin_edges = np.histogram(datadiscreteuniform10, bins=100, density=True)  
# Adjust the number of bins as needed

# Calculate the PDF from the histogram
bin_width = bin_edges[1] - bin_edges[0]
pdf = hist * bin_width

# Plot the empirical PDF
plt.bar(bin_edges[:-1], pdf, width=bin_width, alpha=0.5)
plt.xlabel('X-axis')
plt.ylabel('PDF')
plt.title('Empirical Probability Density Function')
plt.show()

5.8 Exercises

Exercise 5.1

Suppose Y is a real-valued random variable with a continuous cdf F_Y(t) and probability distribution P_Y on (\mathbb{R},\mathcal{B}). Show that the random variable U=F_Y(Y) has a uniform distribution P_0 on [0,1] (i.e. Lebesgue measure).
In Theorem 5.6, using the quantile Q, given a CDF F we constructed a random variable X:([0,1],\mathcal{A},P_0) \to (\mathbb{R},\mathcal{B}) (P_0 is Lebesgue measure) whose CDF is F. In other words we showed Q(U) has CDF F.
Use this fact and part 1. to construct a random variable X':([0,1],\mathcal{B},P_Y) \to (\mathbb{R},\mathcal{B}) such that is CDF is F.

Exercise 5.2 Show that the function X(\omega)= \left\lceil \frac{\ln(1-\omega)}{\ln(1-p)} \right\rceil defines a geometric random variable with success probability p on the probability space (\Omega, \mathcal{A}, P_0) (where P_0 is Lebsegue measure. (Or in other words if U is uniform on [0,1] then \left\lceil \frac{\ln(1-U)}{\ln(1-p)} \right\rceil has a geometric distribution, which provides an easy way to generate geometric random variables on a computer.
Hint: What are the CDF of expomential and geometric random variables?

Exercise 5.3 Some notations:

A probability measure P on the measurable space (\Omega, \mathcal{A}) is called diffuse if P has no atoms.
Two probability measures P and Q on (\Omega, \mathcal{A}) are called singular if we can partition \Omega=\Omega_P\cup \Omega_Q (with \Omega_P\cap \Omega_Q) =\emptyset) such that P(\Omega_P)=1 and Q(\Omega_Q)=1.
The set of all probbaility measures on (\Omega, \mathcal{A}) is denoted by \mathcal{P}(\Omega). It is a convex set: if P,Q \in \mathcal{P}(\Omega) then R=\alpha P + (1-\alpha)Q \in \mathcal{P}(\Omega) for any \alpha \in [0,1]. We say then that R is a mixture of P and Q.

Show the following

Show that any probability measure can be decomposed as a mixture of one singular atomic and one diffuse measure.
Suppose P is a probability measure on (\mathbb{R},\mathcal{B}) with CDF F(t). Describe the decomposition of the measure P into an atomic and diffuse measure in terms of the CDF F.
Suppose P is a diffuse measure on (\mathbb{R},\mathcal{B}) and A \subset \mathbb{R} is any subset with P(A)>0. Show that for any 0\le t \le 1 there exists a set B_t \subset A such that P(B_t)=tP(A).
Hint: Let B_t=A \cap(-\infty,t]. Study the function h(t)=P(B_t).

Exercise 5.4

Prove that the Cantor function (a.k.a devil’s staircase) given in Section 5.4 is continuous and that this defines a diffuse probability measure P.
Let C be the Cantor set obtained by removing from [0,1] the intervals (1/3,2/3) and (1/9, 2/9) (7/9,8/9) and so on. If P_0 is the Lebesgue measure on [0,1], show that P_0(C)=0 and that yet C has the same cardinality as [0,1]. Hint: One option is to use the Cantor function.
Show that the Lebesgue measure on [0,1], the Cantor measure, and any atomic measure are singular.

Exercise 5.5 In this problem I would like you to write at least a pseudo-code (better write a code and run it, modify the code on the previous page for example and visualize your result). We suppose the quantile function of the normal random variable with parameter (\mu,\sigma)=(0,1) is known. For example in python

from scipy.special import ndtri # quantile for the normal RV

Calling random numbers (as many as needed) and using the quantile function ndtri write a code which generates a mixture of 3 normal random variables with parameters ( \mu_1, \sigma_1) = (-2,.4),\quad (\mu_2, \sigma_2)=(0,.3),\quad (\mu_3, \sigma_3)=(3,1) with mixing parameters (2/7,4/7,1/7).

6 Integration with respect to a probability measure

Given a probability space (\Omega, \mathcal{A}, P) and a random variable X: \Omega \to \mathbb{R} how do we define the expectation of X for general random variables?

There are 2 parts in the theory. A general theory using the measure P from which we deduce a more practical way which uses the probability P^X on \mathbb{R} (the only thing we really know how to handle …)

6.1 Definition of the expectation (a.k.a the integral)

We start by giving a definition of expectation for an arbitrary random variables. The definition is a bit rigid and may seem at first sight slighlty arbitrary but subsequent analysis will show that this is a good choice.

Definition 6.1 (Definition of expectation) Let (\Omega, \mathcal{A},P) be a probability space.

Suppose X is a simple RV (i.e., it takes finitely many values) then X=\sum_{j=1}^M b_j 1_{B_j} (in canonical form!). We define
E[X]= \sum_{j=1}^M b_j P(B_j) \tag{6.1}
Suppose X is an arbitrary non-negative RV (i.e. X(\omega)\ge 0 for all \omega \in \Omega.) Then using the functions d_n given in Theorem 4.4 consider the simple RV X_n=d_n \circ X and define E[X]= \lim_{n \to \infty} E[X_n] \quad \text{where the limit allowed to be} +\infty \tag{6.2}
For an arbitrary RV X, write X=X_+-X_- and define E[X]= \left\{ \begin{array}{cl} E[X_+] - E[X_-] & \text{if } E[X_+] < \infty \text{ or } E[X_+] < \infty\\ \text{undefined} & \text{if } E[X_+] = \infty \text{ and } E[X_+] = \infty \end{array} \right. \tag{6.3}

Remarks Let us make a number of comments on the definition.

If the simple RV is not in canonical form, i.e. X=\sum_{i=1}^N a_i 1_{A_i}. Then E[X]=\sum_{n}a_i P(A_i).
The preceeding remark implies that if X and Y are simple random variables then E[X +Y]=E[X]+ E[Y] (linearity of expectation).
If X and Y are simple and nonnegative and X \le Y then E[X]\le E[Y]. This follows from the linearity by writing Y= X + (Y-X) and so E[Y] =E[X] + E[Y-X]. If Y-X \ge 0 then E[Y-X]\ge 0 and so E[X]\le E[Y].
The function d_n are increasing in n, d_n(x) \le d_{n+1}(x) and this implies that X_n \le X_{n+1} and thus E[X_n] \le E[X_{n+1}].
Therefore the limit in Equation 6.2 always exists but could well be equal to +\infty.
The definition in item 2. seems somewhat arbitrary since it is using a particualr choice of simple function d_n. We will show soon that this choice actually does not matter.
For general X we allow the expectation to equal to +\infty (if E[X_+]=\infty] and E[X_-]<\infty]) or (-\infty if E[X_+]<\infty] and E[X_-]=\infty]). If both E[X_+]=\infty] and E[X_-]=\infty] the expectation is undefined.
If X:\omega \to \overline{\mathbb{R}} is is extended real-valued we can still define expectation in the same way.

Definition 6.2 A measurable function is integrable if E[X] < \infty or equivalently if E[|X|] < \infty or equivalently if E[X_\pm]<\infty.

6.2 Monotone Convergence

We extend monotonicity to general non-negative RVs.

Theorem 6.1 (Monotonicity) If X \ge 0 then E[X] \ge 0. If 0 \le X \le Y then E[X]\le E[Y].

Proof. If X \ge 0 so is X_n=d_n\circ X and therefore E[X]\ge 0. If 0 \le X \le Y then X_n \le Y_n and so E[X_n]\le E[Y_n] and thus E[X]\le E[Y].

The next theorem (Monotone convergence Theorem) is very useful in itself and, in addition, the other convergence theorems for expectations derive from it.

Theorem 6.2 (Monotone Convergence Theorem) Suppose X_n are non-negative and increasing: 0\le X_n(\omega) \le X_{n+1}(\omega). Then X(\omega)=\lim_{n \to \infty}X_n(\omega) exists and
\lim_{n \to \infty} E[X_n] = E[X]= E[\lim_{n\to \infty} X_n ]

Proof. Since X_n(\omega) is an increasing the sequence, the limit X(\omega) \in \overline{\mathbb{R}} exists and E[X] exists. By monotonicity, see Theorem 6.1, we have X_n \le X_{n+1} \le X and therefore \lim_{n \to \infty} E[X_n] exists and we have \lim_{n \to \infty} E[X_n] \le E[X]\,. We need to show the reverse inequality: \lim_{n \to \infty} E[X_n] \ge E[X]. To prove this we need to show the following claim.

Claim: Suppose Y is simple and Y \le X then \lim_{n \to \infty} E[X_n] \ge E[Y].

Indeed if the claim is true \lim_{n \to \infty} E[X_n] \ge E[d_k \circ X ] for all k and taking the limit k \to \infty concludes the proof.

To prove the claim take b \ge 0 and consider the set B=\{ X > b\} and set B_n =\{ X_n >b \}. Since B_n \nearrow B we have P(B_n) \to P(B) by sequential continuity. Furthermore X_n 1_{B} \ge X_n 1_{B_n} \ge b 1_{B_n} which implies, by monotonicity, that E[X_n 1_{B}] \ge b P(B_n) and taking n \to \infty we obtain \lim_{n \to \infty} E[X_n 1_B]\ge b P(B). \tag{6.4} Now this inequality remains true if we consider the set \overline{B}=\{ X \ge b\} instead of B. To see this, take an increasing sequence b_m \nearrow b so that \{X >b_m\} \nearrow \{X \ge b\}. Indeed apply Equation 6.4 (with b replaced by b_m) and then used monotonicity.

To conclude note that if Y=\sum_{i=1}^m a_i 1_{A_i} (in canonical form) and X\ge Y then X\ge a_i on A_i. By finite additivity, using Equation 6.4, we have \lim_{n \to \infty} E[X_n] = \sum_{i=1}^m \lim_{n \to \infty} E[X_n 1_{A_i}] \ge \sum_{i=1}^m a_i P(A_i) = E[Y] and this concludes the proof. \quad \square

6.3 Further properties of the expectation

Remark: The monotone convergence theorem shows that if X_n is any sequence of simple function increasing to X then E[X]=\lim_n E[X_n].

Theorem 6.3 (Linearity of Expectation) If X and Y are integrable random variable then for any a, b \in \mathbb{R} we have E[a X + bY] = a E[X] + b E[Y]

Proof. In a first step assume that a,b,X,Y are all nonnegative. If X and Y are simple this is true by the remarks after Definition 6.1. For general X and Y pick X_n and Y_n simple functions which increase to X and Y respectively (e.g. X_n=d_n\circ X or Y_n=d_n\circ X). Then E[a X_n+bY_n]=a E[X_n]+ bE[Y_n]. Now by the Monotone Convergence Theorem aX_n + bY_n increases to aX+bY and thus taking n \to \infty concludes the proof.

6.4 Negligible sets

Definition 6.3

A measurable set A \in \mathcal{A} is negligible with respect to P (or a null set for P) if P(A)=0.
A set A (not necessarily measurable) is negligible with respect to P if there exists B \in \mathcal{A} such that
A \subset B and P(B)=0 (i.e. A is a subset of set of meaasure 0).

It is a fine point of measure theory that negligible set need not be measurable. This is true for example for the Borel \sigma-algebra for example. There is a standard procedure (called the completion of a probability space) which extends the \sigma-algebra and the probability measure P such that all negligible sets are measurable. The idea is to define, with \mathcal{N} denoting all the null sets of \mathcal{A}, a new \sigma-algebra \overline{\mathcal{A}} =\{ A\cup N\,:\, A \in \mathcal{A}, N\in \mathcal{N} \}
and a new probability measure \overline{P}(A \cup N) = P(A)\,. The probability space (\Omega, \overline{\mathcal{A}}, \overline{P}) is called the completion of (\Omega, \mathcal{A}, P).

For example the completion of the Borel \sigma-algebra on [0,1] with the Lebesgue measure is called the Lebesgue \sigma-algebra. This does not play much of a role, but at a few occasions it will be convenient to assume that the space is complete.

Almost sure properties:

We say that two RVs X and Y are equal almost surely if P(X=Y)= P(\{\omega\,:\, X(\omega)=Y(\omega)\})=1 that is X and Y differ on a negligible set. We write X=Y a.s.
If X=Y almost surely then E[X]=E[Y]. Indeed then the simple approximations satisfies X_n=Y_n almost surely. If two random variables are equal almost surely then their expectations are equal (use the canonical form).
We say, for example, that X \ge Y a.s if P(\{\omega\,:\, X(\omega)\ge Y(\omega)\}=1.
We say X_n converges to X almost surely if there exists a negligible sets N such that for all \omega \in \Omega \setminus N we have \lim_{n}X_n(\omega)=X(\omega).

Theorem 6.4 Suppose X\ge 0. Then E[X]=0 if and only X=0 a.s

Proof. If X=0 a.s. then E[X]=0 because E[0]=0. Let A_n=\left\{\omega : X(\omega)\ge \frac{1}{n} \right\}. Then X \ge X 1_{A_n} \ge \frac{1}{n} 1_{A_n} and thus by monotonicity
0=E[X]\ge E[X 1_{A_n}]\ge \frac{1}{n}P(A_n) and thus P(A_n)=0 for all n. But A_n \nearrow \{X > 0\} and thus by sequential continuity P(X = 0)=1. \quad \square

6.5 Fatou’s Lemma

Our first convergence theorem was the monotone convergence theorem Theorem 6.2. Our second convergence theorem still deals with non-negative function random variables and is called the Fatou’s lemma.

Theorem 6.5 (Fatou’s Lemma) Suppose X_n are non-negative random variables. Then E [\liminf_{n} X_n] \le \liminf_{n} [X_n]

Proof. Set Y_n= \inf_{m\ge n} X_m. Then Y_n \ge Y_{n+1} and \liminf_{n} X_n = \lim_{n} \inf_{m \ge n} X_m = \lim_{n} Y_n. We can use the monotone convergence theorem for the sequence Y_n to get E[\liminf_{n} X_n] = \lim_{n} E[Y_n]\,. \tag{6.5} Also for m \ge n we have Y_n =\inf_{k\ge n} X_k \le X_m and so by monotonicity E[Y_n] \le E[X_m] and thus E[Y_n] \le \inf_{m \ge n}E[X_m]\,. \tag{6.6} Combining Equation 6.7 and Equation 6.6 we find E[\liminf_{n} X_n] \le \lim_{n} \inf_{m \ge n}E[X_m] = \liminf_{ n} E[X_n] \tag{6.7} \quad \square

Variation on Fatou’s Lemma: One can deduce directly from Fatou’s Lemma the following results

If X_n \ge Y and Y is an integrable RV then E[\liminf_{n} X_n] \le \liminf_{n} E[X_n].
Proof: Apply Fatou’s Lemma to the RV Y_n=X_n-Y which is nonnegative.
If X_n \le Y and Y is an integrable RV E[\limsup_{n} X_n] \ge \limsup_{n} E[X_n]. Proof: Apply Fatou’s Lemma to the RV Y_n=Y-X_n which is nonnegative.
We shall use these versions of Fatou’s Lemma to prove our next big result, the Dominated Convergene Theorem.
The Fatou’s Lemma tells us that you can “lose probability” but never gain it. For example if X_n \to 0 then \liminf_n E[X_n] \ge 0.
For example if we take \Omega=[0,1] and P the Lebesgue measure. Let X_n(\omega)= n 1_{[0,\frac{1}{n}]}(\omega) Then E[X_n]=nP([0,\frac{1}{n}])=1 for all n. But X_n \to 0 a.s. and so E[\lim_n X_n] =0 \not =1.

6.6 Dominated convergence Theorem

Theorem 6.6 (Dominated convergence theorem) Suppose \{X_n\} is a collection of random variable such that

\lim_{n}X_n(\omega) =X(\omega) for all \omega
There exists an integrable random variable Y such that |X_n|\le Y for all n. Then \lim_{n}E[X_n] = E[X] =E[\lim_{n}X_n]

Proof. We derive it from Fatou’s Lemma. The condition |X_n|\le Y means that -Y \le X_n \le Y.
Applying Fatou’s lemma to Y-X_n \ge 0 we find that E[ \liminf_{n} (Y-X_n)] \le \liminf E[Y - X_n] Using that \liminf_n (-a_n) = - \limsup_n a_n and \lim_n{X_n}=X we find E[ \liminf_{n} (Y-X_n)] = E[Y] + E[ \liminf_{n} (-X_n)] = E[Y] - E[ \limsup_{n} X_n)] =E[Y]-E[X] and \liminf E[Y - X_n] = E[Y] - \limsup_{n} E[X_n] and thus we have \limsup_{n} E[X_n] \le E[X]. Applying Fatou’s to X_n+Y\ge 0 yields in a similar manner E[X] \le \liminf_{n} E[X_n] (check this). Therefore we have \limsup_{n} E[X_n] \le E[X] \le \liminf_{n}E[X_n]. This proves that \lim_{n} E[X_n]=E[X]. \quad \square.

A special case of the dominated convergence theorem is frequently used

Theorem 6.7 (Bounded convergence theorem) Suppose \{X_n\} is a collection of random variable such that

\lim_{n}X_n(\omega) =X(\omega) for all \omega
There exists an integrable random variable c such that |X_n|\le c for all n. Then \lim_{n}E[X_n] = E[X] =E[\lim_{n}X_n]

Proof. Y=c is integrable so the result follows from the dominated convergence theorem. \quad\square.

Remark on almost sure versions:

Monotone convergence theorem, Fatou’s lemma and dominated convergence theorem has also almost sure versions.
For example if Y is integrable and |X_n| \le Y almost surely and X_n(\omega)\to X almost surely then \lim_{n} E[X_n]=E[X]. To see this define N=\{\omega \,:\, |X_n(\omega)|\le Y(\omega) \text{for all } n \} \quad \text{ and } \quad M= \{\omega\,:\,\lim_{n} X_n(\omega)= X(\omega) \} Then P(M^c)=P(N^c)=0. We can modify the RV on sets of measures of 0 in such a way that the statements hold for all \omega: set X_n=0, X=0, Y=0 on M^c \cup N^c. Then the properties holds for all \omega and since the expectations do not change we are done.

6.7 The Expectation rule (very useful for computations)

Computing the expectation of RV X (or h(X)) can be done using either P (good for proofs) or P^X (good for computations). As we will see this is an abstract version of the change of variable formula from Calculus!

Notation Another widely used (and convenient) notation for the expectation is E[X] = \int_\Omega X(\omega) dP(\omega)

Theorem 6.8 (Expectation rule) Suppose X is a RV on (\Omega, \mathcal{A}, P) taking value in (F, \mathcal{F}) and with distribution P^X. Let h:(F, \mathcal{F}) \to (\mathbb{R},\mathcal{B}) be measurable.

h(X) \in \mathcal{L}^1(\Omega, \mathcal{A}, P) if and only if h \in \mathcal{L}^1(F, \mathcal{F},P^X).
If either h \ge 0 or h satisifies the equivalent conditions in 1. we have E[h(X)]= \int_\Omega h(X(\omega)) dP(\Omega) = \int_{F} h(x) d P^X(x) \tag{6.8}
Conversely suppose Q is a probability measure on \mathbb{F} such that E[h(X)]= \int_{F} h(x) dQ(x) for all non-negative measurable h. Then Q=P^X.

Proof. The probability distribution of X, P^X, is defined by P^X(B) = P(X^{-1}(B)). Therefore E[1_B(X)]= P(X \in B) = P^X(B) = \int_F 1_B(x) dP^X(x) This prove Equation 6.8 in the case of a chracteristic function. By linearity, if h is a simple function then Equation 6.8 holds.

If h:F \to \mathbb{R} is positive then pick a sequence of simple function h_n such that h_n \nearrow h. Then \begin{aligned} E[h(X)] = E[\lim_{n \to \infty} h_n(X)] & = \lim_{n \to \infty} E[h_n(X)] \quad \text{ by the MCT in } \Omega \\ & = \lim_{n \to \infty} \int_F h_n(x) dP^X(x) \quad \text{ because } h_n \text{ is simple}. \\ & = \int_F \lim_{n \to \infty} h_n(x) dP^X(x) \quad \text{ by MCT in } F \\ & = \int_F h(x) dP^X(x) \end{aligned} This proves Equation 6.8 for h non-negative. If we apply this to |h| this proves part 1. of the Theorem. For general h, write h=h_+-h_- and deduce the result by substraction.

For the converse in item 3. just take f=1_A to be a characteristic function. Then P(X \in A)=E[1_A(X)]= \int_F 1_A(x) dQ(x)=Q(A)\,. Since A is arbitrary, the distribution of X is Q.

Consequences:

If X is a real-valued random variable, we can compute its expectation as doing an integral on \mathbb{R}
If X is real-valued and h: \mathbb{R} \to \mathbb{R} is a (measurable) function (e.g. X^{n}, or e^{i\alpha X}, or \cdots. Then we
have E[X]= \int_{\mathbb{R}} x dP^X(x)\, \quad E[X^n]= \int_{\mathbb{R}} x^n dP^X(x)\, \quad \cdots An alternative would to compute the distribution P^Y of the Y=X^n and then we have E[X^n] = E[Y] = \int_{\mathbb{R}} y dP^Y(y)
Generally we will compute E[h(X)] using the distribution of X ….
But often we will work backward. We will use the change of variable formula to compute the distribution of Y (see item 3. in Theorem 6.8). Checking the equality for all non-negative function or all characteristic function is not always easy so we will show that one can restrict onesleves to just nice functions! (Later..)
Examples forthcoming

6.8 Applications of the expectations rule.

We investigate how the pdf of a random variable transform under a linear transformation X \mapsto Y= aX+b.

Theorem 6.9 Suppose the real-value RV X has the pdf f_X(t) then Y=aX+b has the pdf f_Y(y)=\frac{1}{|a|}f(\frac{y-b}{a})

Proof. For a change we prove it directly using the expectation rule. The pdf f_Y(y) must satisfy, for any nonnegative h, E[h(Y)] = \int h(y) f_Y(y) dy We rewrite this using the pdf of X using the expectation rule again and the change of variable y=ax+b E[h(Y)] = E[h(aX+b)] = \int_{-\infty}^\infty h(ax+b) f_X(x) dy = \int_{-\infty}^\infty h(y)f_X\left(\frac{y-b}{a}\right) \frac{1}{|a|} dx and therefore we must have f_Y(y)= \frac{1}{|a|} f_X\left(\frac{y-b}{a}\right) \frac{1}{|a|}.

Remark Alternatively you can prove this using the CDF, for example for a>0 F_Y(t)=P (Y \le t) = P(aX+b \le t) = P\left(X\le \frac{y-b}{a}\right)= F_X\left(\frac{y-b}{a}\right) and then differentiate. Do the case a<0.

Location-scale family of random variables: A family of random variables parametrized by parameter \alpha \in \mathbb{R} (=location) and \beta \in (0,\infty) (=scale) is called a location-scale family if X belonginging to the family implies that Y=aX+b also belong to the family for any parameter a and b. If f has a density this is equivalent to require that the densities have the form f_{\alpa,\beta}(x) = \frac{1}{\beta}f(\frac{x-\alpha}{\beta}) for some fixed function f(x).

Normal RV are scale/location family with parameters \mu (=location) and \sigma (=scale)$ and f(x)= \frac{e^{-x^2/2}}{\sqrt{2 \pi}}
The Cauchy distribution with pdf \frac{1}{\beta \pi} \frac{1}{1 + (\frac{x-\alpha}{\beta})^2} is also a location scale family.
Some family of distribution are only a scale family. For example the exponential random variables with density f(x) = \frac{1}{\beta} e^{-x/\beta} are a scale family with scaling parameter \beta.

6.9 Exercises

Exercise 6.1 Show the following facts:

Show that if X=0 almost surely if and only if E[X 1_A]=0 for all measurable sets A.
Suppose X is a random variable with E[X] < \infty. Show that X < \infty almost surely.
Hint: Consider the set B_n=\{X \ge n\}.

Exercise 6.2 (sum of random variables) Suppose X_n is a collection of random variables defined on the probability space (\Omega,\mathcal{A},P).

Prove that if the X_n are all nonnegative then E[\sum_{k=1}^\infty X_k] = \sum_{k=1}^\infty E[X_k].
Hint: Use the monotone convergence theorem.
Prove that if \sum_{n=1}^\infty E[|X_k|] is finite then E[\sum_{k=1}^\infty X_n] = \sum_{k=1}^\infty E[X_k].
Hint: Consider the RV Y=\sum |X_k| and use the dominated convergence theorem and Exercise 7.1, part 2.

Exercise 6.3 (Building new probability measures using densities) Suppose Y is a random variable on the probability space (\Omega, \mathcal{A},P) with Y \ge 0 almost surely and E[Y]=1.

Define Q: \mathcal{A} \to \mathbb{R} by Q(A) = E[Y 1_A]. Show that Q is probability measure on (\Omega, \mathcal{A},P). We denote by E_Q the expectation with respect to Q.
Show, using the definition of the integral, that E_Q[X] = E[XY].
Show if B\in \mathcal{A} is such that P(B)=0 then we have Q(B)=0. (We say then that Q is absolutely continuous with respect to P.)
Show that, in general Q(B)=0 does not imply P(B)=0 but that if Y > 0 almost surely then Q(B)=0 does imply P(B)=0.
Assuming Y>0 almost surely show that \frac{1}{Y} is integrable with respect to Q and show that the measure R defined by R(A) = E_Q[\frac{1}{Y}1_A] is equal to P.

Exercise 6.4 (the log normal distribution)

Suppose X is a normal random variable. Show that the random variable Y=e^{X} has the distribution with the following density f(x) = \left\{ \begin{array}{cl} \frac{1}{x} \frac{1}{\sqrt{2 \pi} \sigma} e^{ - (\log(x)-\mu)^2/2\sigma^2} & x\ge 0 \\\ 0 & x < 0 \end{array} \right. The random variable Y is called the log-normal distribution with parameter \mu and \sigma^2.
Show that E[Y^r]= e^{r \mu + \frac{1}{2}\sigma^2 r^2}. Hint: Do the change of variables y=\log(x)-\mu in the integral for E[Y^r].

Exercise 6.5 (Cauchy distribution)

Suppose X is a random variable with density f_X(x). Express the density f_Y of Y=\frac{a}{X} in terms of f_X.
A Cauchy RV with parameters (\alpha, \beta) has the pdf f(x)=\frac{1}{\beta \pi}\frac{1}{1+ (x-\alpha)^2/\beta^2}.
- Show that if X is a Cauchy RV so is Y=aX+b and find how the parameters transform.
- Show that if X has a Cauchy distribution with \alpha=0 then \frac{1}{X} has again a Cauchy distribution.
- Show that the mean and the variance of a Cauchy RV are undefined.

Exercise 6.6 Consider the RV X with CDF given by F(t)= \left\{ \begin{array}{cl} 0 & t \le -1 \\ 1/4 + \frac{1}{3} (t+1)^2 & -1 \le t < 0 \\ 1 - \frac{1}{4}e^{-2t} & t \ge 0 \end{array} \right. Compute E[X] and {\rm Var}(X).

7 Inequalities

7.1 Jensen inequality

Some facts about convex functions: Recall that a function h on \mathbb{R^d} is convex if \phi( \alpha x + (1-\alpha)y) \le \alpha \phi(x) + (1-\alpha) \phi(y) for all x,y and all 0 \le \alpha \le 1. This means that the line segment between (x,\phi(x)) and (y,\phi(y)) lies above the graph of \phi(z) for z lying on the line segement between x and y.

An equivalent description of a convex fuction (“obvious” from a picture, with a proof in the homework) is that at any point x_0 we can find a plane l(x) in \mathbb{R}^n \times \mathbb{R} such that \phi(x) \ge l(x)=f(x_0) + c\cdot(x-x_0) (the graph of f lies above l for all x) and \phi(x_0)=l(x_0). If f is differentiable at x_0 the plane is given by the tangent plane to the graph at x_0, we have f(x) \ge f(x_0) + \nabla f(x_0)\cdot(x-x_0)
If f is twice continuously differentiable then f is convex if and only if the matrix of second derivative D^2f(x_0) is positive definite.

Theorem 7.1 (Jensen inequality) If \phi: \mathbb{R} \to \mathbb{R} is a convex function then E[\phi(X)]\ge \phi(E[X]) provided both expectations exist, i.e. E[|X|]< \infty and E[|\phi(X)|] < \infty.

Proof. Choose x_0=E[X] and pick a supporting hyperplane l(x) at x_0 so that for any x \phi(x) \ge \phi(E[X]) + l(E[X]) (x - E[X]) By the motonicity of expectation we obtain E[\phi(X)] \ge \phi(E[X]) + l(E[X]) E[(X - E[X])] = \phi(E[X])\,.

Examples

Since f(x)=x^2 is convex we have E[X]^2 \le E[X^2].
Since f(x)=e^{\alpha x} is convex for any \alpha \in \mathbb{R} we have E[e^{\alpha X}] \ge e^{\alpha E[X]}.

Remark The theory of convex functions is very rich and immensely useful!

We will need the following slight generalization of Jensen inequality

Theorem 7.2 If \phi: \mathbb{R}^d \to \mathbb{R} is a convex function and X=(X_1, \cdots, X_d) is a RV taking values in \mathbb{R}^d. Then we have E[\phi(X)]\ge \phi((E[X_1], \cdots, E[X_d])) provided both expectations exist.

Proof. Same proof as Jensen.

7.2 L^p-norms

Suppose (\Omega,\mathcal{A},P) is a probability space and X is a real-valued random variable.

Definition 7.1 (L^p-norms) Given a random variable X and 1 \le p \le \infty we define \|X\|_p = E[|X|^p]^\frac{1}{p} \quad \text{ for } 1 \le p < \infty and \|X\|_\infty= \inf\{ b \in \mathbb{R}_+\,:\, |X|\le b \textrm{ a.s} \} and \|X\|_p is called the L^p norm of a RV X.

Remarks

It is easy to check that \|X\|_p=0 \implies X=0 \text{almost surely}, \|cX\|_p = c \|X\|_p.
\|X\|_p <\infty means that |X|^p is integrable (if 1 \le p < \infty) and that X is almost surely bounded (if p=\infty). Often \|X\|_\infty is called the essential supremum of X.

7.3 Cauchy-Schwartz, Hölder, Minkowski

Theorem 7.3 (Hölder and Minkowski inequalities)

H"older: Suppose 1\le p,q\le \infty are such that \frac{1}{p}+\frac{1}{q}=1 then we have \|X Y\|_1 \le \|X\|_p \|Y\|_q\,. Special case is the Cauchy-Schwartz inequality p=q=2 \|X Y\|_1 \le \|X\|_2 \|Y\|_2 \,.
Minkowski: For 1\le p \le \infty we have \|X + Y\|_p \le \|X\|_p + \|Y\|_p \,. (a.k.a triangle inequality)

Proof. The proof is ultimately a consequence of Jensen inequality (there are many different proofs but all relies in one way or the other on convexity).

To start one shows that the functions f(u,v)=u^{b} v^{1-b} \quad \text{ and } g(u,v)=(u^{b} + v^{b})^\frac{1}{b} \tag{7.1} are concave on \mathbb{R}^2_+ for b \in (0,1]) (that is -f and -g are convex). To see this compute for example D^2f and D^2g.

Once this is done let us turn to Hölder inequality:

If p=1 and q=\infty then we have |XY|\le|X| \|Y\|_\infty almost surely and thus \|XY\|_1\le \|X\|_1\|Y\|_\infty.
The convexity of f in Equation 7.1 implies that for non negative random variables U and V we have E[ U^b V^{1-b}] \le E[U]^b E[V]^{1-b}\,. If 1 < p < q <\infty then we set b=\frac{1}{p} and 1-b=\frac{1}{q} and U=|X|^p and V=|Y|^q. Then E[ |X| |Y| ] = E\left[ (|X|^p)^\frac{1}{p} (|Y|^q)^\frac{1}{q}\right] \le E[|X|^p]^\frac1p E[|Y|^q]^\frac{1}{q}

For Minkowski

If p=\infty Minkowski inequality is easy to check.
The convexity of f in Equation 7.1 implies that for non negative random variables U and V we have E[ U^b + V^{b}] \le \left(E[U]^b + E[V]^{b}\right)^{1/b}\,. which implies Minkovski if we take b=\frac{1}{p} and U=|X|^p V=|Y|^q.

\quad \square.

Definition 7.2 (L^p spaces) For 1\le p \le \infty we define \mathcal{L}^p(\Omega,\mathcal{A},P) =\left\{ X: \Omega \to \mathbb{R}, \|X\|_p < \infty \right\} and the quotient space L^p(\Omega,\mathcal{A},P) = \mathcal{L}^p(\Omega,\mathcal{A},P)/\sim where X \sim Y means X=Y a.s is an equivalence relation.
The space L^p(\Omega,\mathcal{A},P) is a normed vector space.

Theorem 7.4 The map p \mapsto \|X\|_p is an increasing map from [1,\infty) to [0,\infty).

If \|X\|_\infty < \infty then p \mapsto \|X\|_p is continuous on [1,\infty].
If \|X\|_\infty = \infty there exists q such that \|X\|_p is continuous on [1,q] (that is left-continuous at q) and \|X\|_p=+\infty on (q,\infty).

Proof. Homework

Examples:

If X has a Pareto distribution with parameter \alpha and x_0 then its the CDF is F(t)= 1- \left(\frac{t}{x_0}\right)^\alpha \quad \text{ for } t \ge x_0 and F(t)=0 for t \le x_0.
The pdf is f(x) = \frac{\alpha x_0^\alpha}{x^{\alpha+1}} and we have E[|X|^p] = E[X^p] = \int_{x_0}^\infty \alpha x_0^\alpha x^{-\alpha -1 + p} dx = \left\{ \begin{array}{cl} \frac{\alpha}{\alpha-\beta} x_0^{\beta}& p <\alpha \\ +\infty & \beta \ge p \end{array} \right.
If X has a normal distribution (or an exponential, gamma, etc…} then X \in L^p for all 1\le p <\infty but X \notin L^\infty.
Other norms exists (Orlicz norms) to capture the tail of random variables.

7.4 Markov, Chebyshev, and Chernov

Another very important inequality is the so-called Markov equality. Very simple and very useful.

Theorem 7.5 (Markov inequality) If X \ge 0 then for any a>0 P(X \ge a) \le \frac{E[X]}{a}

Proof. Using that X is non-negative we have X \ge X 1_{X\ge a} \ge a 1_{X\ge a} and taking expectation and monotonicity gives the result.

Theorem 7.6 (Chebyshev inequality) We have
P(|X-E[X]| \ge \varepsilon) \le \frac{{\rm Var}[X]}{\varepsilon^2} where {\rm Var}(X)=E[X - E[X]] is the variance of X.

Proof. Apply Markov inequality to the random variable (X- E[X])^2 whose expectation is {\rm Var}[X]: P(|X-E[X]| \ge \varepsilon) = P((X-E[X])^2 \ge \varepsilon^2) \le \frac{E[(X-E[X])^2]}{\varepsilon^2} =\frac{{\rm Var}[X]}{\varepsilon^2}

Chebyshev inequality suggests measuring deviation from the mean in multiple of the standard deviation \sigma= \sqrt{{\rm Var}[X]}: P(|X-E[X]| \ge k \sigma ) \le \frac{1}{k^2}
Chebyshev inequality might be extremly pessimistic
Chebyshev is sharp. Consider the RV X with distributioncP(X=\pm 1) = \frac{1}{2k^2} \quad P(X=0)=1 -\frac{1}{k^2} Then E[X]=0 and {\rm Var}[X]=\frac{1}{k^2} P( |X|\ge k \sigma) = P(|X| \ge 1 )=\frac{1}{k^2}

Theorem 7.7 (Chernov inequality) We have for any a P(X \ge a) \le \inf_{t \ge 0} \frac{E[e^{tX}]}{e^{ta}} \quad \quad P(X \le a) \le \inf_{t \le 0} \frac{E[e^{tX}]}{e^{ta}}

Proof. This is again an application of Markov inequality. If t \ge 0 since the function e^{tx} is increasing P( X \ge a) = P( e^{tX} \ge e^{ta} ) \le \frac{E[e^{tX}]}{e^{ta}} Since this holds for any t\ge 0 we can then optimize over t. The second inequality is proved in the same manner. \quad \square

Chernov inequality is a very sharp inequality as we will explore later on when studying the law of large numbers. The optimization over t is the key ingredient which ensures sharpness.
The function M(t)=E[e^{tX}] is called the moment generating function for the RV X and we will meet again.
Example Suppose X is a standard normal random variable \mu=0 and \sigma^2. Then, completing the square, we have E[e^{tX}]= \frac{1}{\sqrt{2\pi}} \int e^{tx} e^{-x^2/2\sigma^2} dx = \int e^{-(x-\sigma^2t)^2/2} e^{\sigma^2t^2/2} = e^{\sigma^2t^2/2} and Chernov bound gives for a \ge 0 that P(X \ge a) \le \sup_{t\ge 0} e^{\sigma^2t^2/2 -ta} = e^{- \inf_{t\ge 0} (ta - \sigma^2t^2/2)} = e^{-a^2/2\sigma^2} which turns out to be sharp up to a prefactor (see exercises).

7.5 Exercises

Exercise 7.1

Prove the one-sided Chebyshev inequality: if X is a random variable and \epsilon >0 then P( X - E[X] \ge \epsilon ) \le \frac{\sigma^2}{\sigma^2 + \epsilon^2} where \sigma^2={\rm Var}(X).
Hint: Set Y=X-E[X] and use Markov inequality for P(Y \ge \epsilon) = P( (Y+\alpha)^2 \ge (\epsilon+ \alpha)^2) and optimize over \alpha
Prove that
- The one-sided Chebyshev inequality is sharper than the Chebyshev inequality for one sided bounds P(X - E[X] \ge \epsilon).
- The Chebyshev inequality is is sharper than the one-sided Chebyshev inquality for two sided bound P(|X - E[X]| \ge \epsilon)

Exercise 7.2 Prove Theorem 7.4. For the monotonicity use Hölder or Jensen. For the continuity let p_n \nearrow q and use the dominated convergence theorem.

Exercise 7.3 The Chernov bound has the form P( X \ge a) \le e^{- \sup_{t \ge 0}\{ ta - \ln E[e^{tX}]\}}.
Show that this bound is useful only if a > E[X]. To do this use Jensen inequality to show that if a \le E[X] the Chernov bound is trivial.

Exercise 7.4 Consider an exponential RV X with parameter \lambda and density \lambda e^{-\lambda t} for t \ge 0.

Compute M(t)=e^{tX} as well as all moments E[X^n]
To do a Chernov bound compute \sup_{t \ge 0}\{ ta - \ln E[e^{tX}] \} (see also Exercise 7.3).
For a > \frac{1}{\lambda} estimate P( X >a) (which of course is equal to e^{-\lambda a}!) using
1. Markov inequality.
2. Chebyshev
3. One-sided Chebyshev
4. Chernov

Exercise 7.5 (mean and median of a RV) A median m for a RV X is a value m such that P(X\le m) \ge \frac{1}{2} and P(X \ge m) \ge 1/2 (see the quantile). For example if the CDF is one-to-one then the median m=F^{-1}(\frac{1}{2}) is unique.

The median m and the mean \mu= E[X] are two measures (usually distinct) of the “central value” of the RV X.

Consider the minimum square deviation \min_{a \in \mathbb{R}} E[ (X-a)^2]. Show that the minimum is attained when a=E[X].
Consider the minimum absolute deviation \min_{a \in \mathbb{R}} E[ |X-a|]. Show that the minimum is attained when a is a median m.
Hint: Suppose a >m then we have |z-a| - |z-m| = \left\{ \begin{array}{ll} m - a & \textrm{ if } z \ge a \\ a +m -2z \quad (\ge m -a) & \textrm{ if } m < z < a \\ a-m & \textrm{ if } z \le m \end{array} \right.

Exercise 7.6 (mean and median of a RV, continued) Our goal is to prove a bound on how far the mean and the median can be apart from each other, namely that |\mu - m| \le \sigma where \sigma is the standard deviation of X. I am asking for two proofs:

First proof: Use the characterization of the median in Exercise 7.5 and Jensen inequality (twice) starting from |\mu-m|.
Second proof: Use the one-sided Chebyshev for X and -X with \epsilon = \sigma.

The quantity S=\frac{\mu-m}{\sigma} \in [-1,1] is called the non-parametric skew of X and measures the assymetry of the distribution of X.

8 Independence and product measures

8.1 Independence

Two events A,B \in \mathcal{A} are independent if P(A|B)=P(A) or equivalently P(A \cap B) =P(A) P(B) and this generalizes to arbitrary collection of events (see Definition 1.8)
For two random variable X and Y to be independent it should means that any information or knowledege derived from the RV Y should not influence the RV X. All the information encoded in the RV X taking values in (E,\mathcal{E}) is the \sigma-algebra generated by X that is \sigma(X) = X^{-1}(\mathcal{E}) = \left\{ X^{-1}(B), B \in \mathcal{E}\right\}. This motivates the following definition.

Definition 8.1 (Independence) Let (\Omega, \mathcal{A},P) be a probability space.

Independence of \sigma-algebras: Two sub-\sigma-algebra \mathcal{A}_1 \subset \mathcal{A} and \mathcal{A}_2 \subset \mathcal{A} are independent if P(A_1 \cap A_2) = P(A_1) P(A_2) \quad \text{ for all } A_1 \in \mathcal{A}_1, A_2 \in \mathcal{A}_2\,. A collection (not necessarily countable) of \sigma-algebras \left\{\mathcal{A}_j\right\}_{j \in J} is independent if for any finite subset I \subset J we have
P\left( \bigcap_{i\in I} A_i\right) = \prod_{i \in I} P(A_i) \quad \text{ for all } A_i \in \mathcal{A}_i \,.
Independence of random variables: The collection of random variables X_j: (\Omega, \mathcal{A},P) \to (E_j, \mathcal{E}_j) for j \in J are indedpendent if the collection of \sigma-algebras \sigma(X_j) are independent.

We consider from now on only two random variables X and Y but all generalizes easily to arbitrary collections. Our next theorem makes the definition a bit more easy to check.

Theorem 8.1 (Characterization of independence) Two random variables X (taking values in (E,\mathcal{E})) and Y taking values in (F,\mathcal{F})) to be independent if and only if any of the following equivalent conditions holds.

P(X \in A, Y \in B)=P(X \in A) P(X \in B) for all A \in \mathcal{E} and for all B \in \mathcal{F}.
$P(X A, Y B)=P(X A) P(X B) $ for all A \in \mathcal{C} and for all B \in \mathcal{D} where \mathcal{C} and \mathcal{D} are p-systems generating \mathcal{E} and \mathcal{F}.
f(X) and g(Y) are independent for any measurable f and g.
E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and measurable (or nonenegative) f,g.
If E=F=\mathbb{R} (or more generally E and F are complete separabale metric spaces) E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and continuous functions f,g.

Proof.

Item 1. is merely a restatement of the definition and clearly item 1. \implies item 2.

To see that Item 2. \implies item 1. we use the monotone class theorem Theorem 3.1. Fix B \in \mathcal{D} then the collection \left\{ A \in \mathcal{E}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\} contains the p-system \mathcal{C} and is a d-system (contains \Omega, closed under complement, closed under increasing limits, check this please). Therefore by the monotone class theorem it contains \mathcal{E}. Analgously fix now A \in \mathcal{E}, then the set
\left\{ B \in \mathcal{F}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\} contains \mathcal{F}.
To see that item 3. \implies item 1. first we take f(x)=x and g(y)=y. Conversely we note that V=f(X) is measurable with respect to \sigma(X): since V^{-1}(B) = X^{-1} (f^{-1}(B)) \in \sigma(X) this shows that \sigma(f(X))\subset \sigma(X). Likewise \sigma(g(Y)) \subset \sigma(Y). Since \sigma(X) and \sigma(Y) are independent so are \sigma(f(X)) and \sigma(g(Y)).
To see that item 4. \implies item 1. take f=1_A and g=1_B. To show that item 1. \implies item 4. note that item 1. can be rewritten that E[1_A(X) 1_B(Y)] =E[1_A(X)] E[1_B(Y)]. By linearity of the expectation then item 4. holds for all simple functions g and g. If f and g are non negative then we choose sequences of simple functions such that f_n \nearrow f anf g_n \nearrow g. We have then f_n g_n \nearrow fg and using the monotone convergence theorem twice we have \begin{aligned} E[f(X)g(Y)] & = E[ \lim_{n} f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) ] E[g_n(Y)] \\ &= \lim_{n} E[ f_n(X) ] \lim_{n} E[g_n(Y)] =E[ f(X) ] E[g(Y)] \end{aligned}

If f and g are bounded and measurable then we write f=f_+-f_- and g=g_=-g_-. Then f_\pm and g_\pm are bounded and measurable and thus the product of f_\pm(X) and g_\pm(Y) are also integrable. We have \begin{aligned} E[f(X)g(Y)] & = E[(f_+(X)-f_-(X))(g_+(Y)-g_-(Y))] \\ &= E[ f_+(X)g_+(Y)]+ E[f_-(X)g_-(Y)] - E[f_+(X)g_-(Y)]- E[f_-(X)(g_+(Y)] \\ &= E[ f_+(X)]E[g_+(Y)]+ E[f_-(X)]E[g_-(Y)] - E[f_+(X)]E[g_-(Y)]- E[f_-(X])E[g_+(Y)] \\ &= E[ f_+(X)-f_-(X)] E[ g_+(Y)-g_-(Y)] \end{aligned}

Clearly item 1. \implies item 4. \implies 5.. For the converse we show that item 5. \implies item 2. Given an interval (a,b) consider an increasing sequence of piecewise linear function continuous function f_n \nearrow 1_{(a,b)}. f_n(t) = \left\{ \begin{array}{cl} 0 & t \le a + \frac{1}{n} \text{ or } t \ge b-\frac{1}{n} \\ 1 & a+\frac{1}{n} \le t \le b-\frac{1}{n}\\ \text{linear} & \text{otherwise } \end{array} \right. Let us consider the p-system which contains all intervals of the form (a,b) and which generate the Borel \sigma-algebra \mathcal{B}. By using a monotone convergence argument like for item 1. \implies item 4. we see that for f_n \nearrow 1_{(a,b)} and g_n \nearrow 1_{(c,d)} E[f_n(X) g_n(Y)]= E[f_n(X)] E[g_n(Y)] \implies P(X \in (a,b)) P(Y \in (c,d)) and so item 2. holds.
For separable metric space, the so-called Urysohn lemma can be used to prove the same results. \quad \square

8.2 Independence and product measures

While we have expressed so far independence of X and Y as a property on the state \Omega we can also view it as a property of the distributions P^X and P^Y.
Example: Suppose X and Y are discrete random variables then if they are independent we have P(X =i, Y=j) = P(X=i) P(Y=j) and thus P^{(X,Y)}(i,j) = P^X(i) P^Y(j). That is the distribution of the random variable Z=(X,Y) factorizes into the product of P^X and P^Y.
[Product spaces.}{.red} In order to build-up more examples we need the so-called Fubini theorem. Given two measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) we consider the product space (E \times F, \mathcal{E} \otimes \mathcal{F}) where \mathcal{E} \otimes \mathcal{F} is the sigma-algebra generated by the rectangles A \times B.
[Measurable functions on product spaces.}{.red} As we have seen in Exercise 4.6 for any measurable function f: E \times F \to \mathbb{R} the sections g(y)=f(x,y) (for any x) and h(x)=f(x,y) (for any y) are measurable.

Theorem 8.2 (Tonelli-Fubini Theorem) Suppose P is a probability on (E,\mathcal{E}) and Q is a probability on (F,\mathcal{F}).

The function R: \mathcal{E}\times \mathcal{F} defined by R(A \times B) = P(A) Q(B) extends to a unique probability measure on \mathcal{E}\otimes \mathcal{F}. This measure is denoted by P \otimes Q and is called the product measure of P and Q.
Suppose f is measurable with respect to \mathcal{E}\otimes \mathcal{F}, either non-negative, or integrable with respect to P \otimes Q. Then the functions x \mapsto \int f(x,y) dQ(y) \quad \text{ and } \quad y \mapsto \int f(x,y) dP(x) are integrable with respect to P and Q respectively and we have \int_{E \times F} f(x,y) d (P\otimes Q) (x,y) = \int_E \left(\int_F f(x,y) dQ(y)\right) dP(x) = \int_F \left(\int_E f(x,y) dP(x)\right) dQ(y)

Proof. For item 1. we need to extend the R to arbitray element in \mathcal{E}\otimes\mathcal{F}. For C \in\mathcal{E}\otimes\mathcal{F} consider the slice of C along x given by C(x) = \left\{ y \in F\,: (x,y) \in C \right\} \,. If C = A \times B, then C(x)=B for all x\in A and C(x)=\emptyset for x \notin A and we have then R(C) = P(A)Q(B) = \int_E Q(C(x)) dP(x) Now we define \mathcal{H}=\left\{ C \in \mathcal{E}\otimes\mathcal{F}\,:\, x \to Q(C(x)) \text{ is measurable} \right\} it is not difficult to check that \mathcal{H} is a \sigma-algebra and \mathcal{H} \supset E \times F and therefore \mathcal{H} = E \otimes F.
We now define, for any C \in \mathcal{H} = E \otimes F, R(C) = R(C) = P(A \times B) = \int_E Q(C(x)) dP(x) and we check this is a probability measure. Clearly R(E \times F) = P(E)Q(F)=1. Let C_n \in E \otimes F be pairwise disjoint and C = \cup_{n=1}^\infty C_n. Then the slices C_n(x) are pairwise disjoint and by the monotone convergence theorem Q(C(x))=\sum_{n=1}^\infty Q(C_n(x)).

Applying MCT again to the function g_n(x) = \sum_{k=1}^n Q(C_n(x)) we find that \sum_{n=1}^\infty R(C_n) = \sum_{n=1}^\infty \int_E Q(C_n(x)) dP(x) = \int_E \sum_{n=1}^\infty Q(C_n(x)) dP(x) = \int_E Q(C(x)) dP(x) =R(C). and this shows that R is a probability measure. Uniqueness of R follows from the monotone class theorem.

For item 2. note that in item 1, we have proved it in the case where f(x,y)= 1_C(x,y). By linearity the result then holds for simple functions. If f is nonnegative and measurable then pick an increasing sequence such that f_n \nearrow f. Then by MCT \int f(x,y) d P\otimes Q(x,y) = \lim_{n} \int f_n(x,y) d P\otimes Q(x,y) = \lim_{n} \int_E (\int_F f_n(x,y) d Q(y)) dP(x) But the function x \to \int_F f_n(x,y) d Q(y) is increasing in n and by MCT again, and again. \begin{aligned} \int f(x,y) d P\otimes Q(x,y) & = \int_E \lim_{n} (\int_F f_n(x,y) d Q(y)) dP(x) = \int_E (\int_F \lim_{n} f_n(x,y) d Q(y)) dP(x) \\ &= \int_E (\int_F f(x,y) d Q(y)) dP(x) \end{aligned} Simlarly one shows that \int f(x,y) d P\otimes Q(x,y) = \int_F (\int_E f(x,y) d P(x)) dQ(y). The result for integrable f follows by decomposing into postive and negative part. \quad \square

Applying this to random variables we get

Corollary 8.1 Suppose Z=(X,Y): (\Omega, \mathcal{A}, P) \to (E \times F, \mathcal{E}\otimes \mathcal{F}). Then the random variables X and Y are independent if and only if the distribution P^{(X,Y)} of the pair (X,Y) is equal to P^X \otimes P^Y.

Proof. The random variables X and Yare independent if and only if P((X,Y) \in A \times B)=P(X \in A) P(X \in B)\,.
This is equivalent to saying that P^{(X,Y)}(A \times B) = P^X(A) \times P^Y(B) By the uniqueness in Fubini Theorem we have P^{(X,Y)}=P^X \otimes P^Y. \quad \square

8.3 Constructing a probability space for independent random variables

We can construct a probability model for X_1, \cdots, X_n which are independent (real-valued) random variables with given distribution P^{X_1}, \cdots ,P^{X_n}.

By Theorem 5.6 we know how to construct the probability space for each random variable separately, for example, X_i : (\Omega_i,\mathcal{B_i},P_i) \to (\mathbb{R}, \mathcal{B}) where \Omega_i=[0,1], P_i is Lebesgue measure on [0,1] and X_i = Q_i where Q_i is a quantile function for X_i.

We know take \Omega = \prod_{i} \Omega_i = [0,1]^n \,, \quad \mathcal{B}= \otimes_{i=1}^n \mathcal{B_i}, \quad P= \otimes_{i=1}^n P_i and define the map X: \Omega \to \mathbb{R}^n by X=(X_1,\cdots X_n).

Fubini-Tonelli Theorems shows that P^{X}=P \circ X^{-1} is the distribution of X=(X_1,\cdots X_n) on \mathbb{R}^n with the product \sigma-algebra and that the random variables are independent

Fubini-Tonelli for countably many RV:

We extend this result to countable many independent random variables: this is important in practice where we need such models: for example flipping a coin as many times as needed!

This can be seen as an extension of Fubini theorem and is also a specical case of the so-called Kolmogorov extension theorem which us used to construct general probability measures on infinite product space. No proof is given here.

Infinite product \sigma-algebras
Given \sigma-algebras \mathcal{A}_j on \Omega_j we set \Omega = \prod_{j=1}^\infty \Omega_j and define rectangles for n_1 < n_2 < \cdots < n_k and k finite but arbitrary A_{n_1} \times A_{n_2} \times \cdots \times A_{n_k} = \{ \omega=(\omega_1,\omega_2,\omega_3, \cdots) \in \Omega \;: w_{n_j} \in A_{n_j}\} where A_{n_j} \in \mathcal{A_{n_j}}. The product \sigma-algebra \mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j the \sigma-algebras generated by all the rectangles.

Theorem 8.3 Given probaility spaces (\Omega_i,\mathcal{A}_i,P_i) and with \Omega = \prod_{j=1}^\infty \Omega_i and \mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j there exists a unique probability P on \Omega(A) such that P(A_{n_1} \times \cdots \times A_{n_k}) = \prod_{j=1}^k P_{n_j}(A_{n_j}) for all A_{n_j}\in \mathcal{A}_{n_j}, all n_1< \cdots < n_k and all k.

If we have a RV X_n: \Omega_n \to \mathbb{R} then we define its extension to \Omega by \tilde{X_n}(\omega) = X_n(\omega_n) and the distribution of \tilde{X_n} is the same as the distribution of X_n because \tilde{X_n}^{-1}(B_n) = \Omega_1 \times \cdots \times \Omega_{n-1} \times X_n^{-1}(B_n) \times \Omega_{n+1} \times \cdots and thus P(\tilde{X_n}\in B_n) =P_n (X_n \in B_n)\,. A similar computation shows that \tilde{X}_n and \tilde{X}_m are independent for n \not= m or more generally any finite collection of X_j's are independent.

8.4 Kolmogorov zero-one law

We consider (X_n)_{n=1}^\infty to be RVs defined on some probability space \Omega. We may think of n as “time” and we consider the following \sigma-algebras
\begin{aligned} \mathcal{B}_n & = \sigma(X_n) & \text{the $\sigma$-algebra generated by } X_n \\ \mathcal{C}_n & = \sigma\left( \cup_{p \ge n} \mathcal{B}_n\right) & \text{the $\sigma$-algebra describing the "future" after time } n \\ \mathcal{C}_\infty & = \cap_{n=1}^\infty \mathcal{C}_n & \text{the "tail" $\sigma$-algebra or $\sigma$-algebra "at infinity"} \end{aligned}

Theorem 8.4 (Zero-one law) Suppose X_n is a sequence of independent random variables and let \mathcal{C_\infty} be the corresponding tail \sigma-algebra. Then we have C \in \mathcal{C}_\infty\implies P(C)=0 \text{ or } P(C)=1

Proof. The \sigma algebras \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_n\} are independent and therefore \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_\infty\} are independent for every n since \mathcal{C}_\infty \subset \mathcal{C}_n. Therefore \mathcal{C}_0=\sigma\left( \cup_{n \ge 0} \mathcal{B}_n\right) and \mathcal{C}_\infty are independent. So for A \in \mathcal{C}_0 and B \in \mathcal{C}_\infty we have P(A \cap B)=P(A) P(B). This holds also for A=B since \mathcal{C}_\infty \subset \mathcal{C}_0. Therefore we have P(A)=P(A)^2 which is possible only if P(A)=0 or 1 \quad \square.

Examples Given independent random variable X_1, X_2, \cdots we define S_n = X_1 + X_2 +\cdots + X_n.

The event \{ \omega\,:\, \lim_{n} X_n(\omega) \text{ exists }\} belongs to every \mathcal{C}_n and thus belong to \mathcal{C}_\infty. Therefore X_n either converges a.s or diverges a.s.
A random variable which is measurable with respect to \mathcal{C}_\infty must be constant almost surely. Therefore \limsup_n X_n \,, \quad \liminf_n X_n\,, \quad \limsup_n \frac{1}{n} S_n\,, \quad \liminf_n \frac{1}{n} S_n are all constant almost surely.
The event \limsup_{n} \{ X_n \in B\} (also called \{ X_n \in B \text{ infinitely often }\}) is in \mathcal{C}_\infty.
The event \limsup_{n} \{ S_n \in B\} is not in \mathcal{C}_\infty.

8.5 Exercises

Exercise 8.1

Suppose X\ge 0 is a non-negative random variable and p > 0. Show that E[X^p]= \int_0^\infty p t^{p-1} (1- F(t)) dt = \int_0^\infty p t^{p-1} P(X>t) dt In particular E[X]=\int P(X>t) dt.
Hint: Use Fubini on the product measure P times Lebesgue measure on [0,\infty).
Deduce from this that if X takes non-negative integer values we have E[X]= \sum_{n >0} P( X>n)\,, \quad E[X^2]= 2 \sum_{n >0} n P( X>n) + E[X] \,.

Exercise 8.2 Find three random variable X, Y, and Z taking values in \{-1,1\} which are pairwise independent but are not independent.

Exercise 8.3

A random variable is a Bernoulli RV if X takes only values 0 or 1 and then E[X]=P(X=1) (alternatively you can think of X(\omega)=1_A(\omega) for some measurable set A.) Show that Y=1-X is also Bernoulli and that the product of two Bernoulli random variables is again a Bernoulli RV (no independence required).
Suppose X_1, X_2, \cdots, X_n are Bernoulli random variables on some probability space (\Omega,\mathcal{A},P) (they are not assumed to be independent) and let Y_k = 1-X_k. Show that P(X_1=0, X_2=0, \cdots, X_n=0) =E\left[ \prod_{i=1}^n Y_i\right] and P(X_1=0, \cdots, X_k =0,X_{k+1}=1, \cdots, X_n=1) =E \left[ \prod_{i=1}^k Y_i \prod_{j=k+1}^n X_j\right]
Show that the Bernoulli random variables X_1, \cdots, X_n are independent if and only E[\prod_{j\in J} X_j]= \prod_{j\in J}E[X_j] for all subset J of \{1,\cdots, n\}.

Exercise 8.4 Consider the probability space ([0,1), \mathcal{B}, P) where \mathcal B is the Borel \sigma-algebra and P is the uniform distribution on [0,1). Expand each number \omega in [0,1) in dyadic expansion (or bits) \omega = \sum_{n=1}^\infty \frac{d_n(\omega)}{2^n} =0. d_1(\omega)d_2(\omega)d_3(\omega) \cdots \quad \quad \textrm{ with } \quad d_n(\omega) \in \{0,1\}

For certain numbers the dyadic expansion is not unique. For example we have \frac{1}{2} = 0.100000 = \frac{1}{4} + \frac{1}{8} + \frac{1}{16} + \cdots = 0.01111111111\cdots Show that P almost surely a number \omega has a unique dyadic expansion.
Prove that each d_n(\omega) is a random variable and that they are identically distributed (i.e. each d_n has the same distribution).
Prove that the d_n, n=1,2,3, \cdots are a collection of independent and identically distributed random variables.

Remark: this problem shows that you can think of the Lebesgue measure on [0,1) as the infinite product measure of independent Bernoulli trials.

Exercise 8.5 Suppose X and Y are RV with finite variance (or equivalently X,Y\in L^2. The covariance of X and Y is defined by {\rm Cov}(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY]- E[X]E[Y]

Show that {\rm Cov}(X,Y) is well defined and bounded by \sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}.
The correlation coefficient \rho(X,Y)=\frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}} measure the correlation between X and Y. Given a number \alpha \in [-1,1] find two random variables X and Y such that \rho(X,Y)=\alpha.
Show that {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(X+Y) + 2{\rm Cov}(X,Y).
Show that if X and Y are independent then {\rm Cov}(X,Y)=0 and so {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(Y).
The converse statement of 3. is in general not true, that is {\rm Cov}(X,Y)=0 does not imply that X and Y are independent. ` Hint For example take X to be standard normal Z discrete with P(Z=\pm 1)=\frac12 (Z is sometimes called a Rademacher RV) and Y=ZX.
Suppose X_1, \cdots, X_n are independent RV with E[X_j]=\mu and {\rm Var}(X_j)=\sigma^2< \infty for all j. The sample mean \overline{X} and sample variance S^2 are given by \overline{X}=\frac{1}{n}\sum_{j=1}^n X_j \quad \quad S^2= \frac{1}{n-1}\sum_{j=1}^n(X_j -\overline{X})^2 Show that E[\overline{X}]=\mu, {\rm Var(\overline{X})=\frac{\sigma^2}{n}} and E[S^2]=\sigma^2.

9 Probability kernels and measures on product spaces

9.1 Probability kernels

How do we build “general” measures on some product space E \times F (e.g. \mathbb{R}^2 or \mathbb{R}^n)

Definition 9.1 Given 2 measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) a probability kernel K(x,B) from (E,\mathcal{E}) into (F,\mathcal{F}) (also often called a Markov kernel) is a map K: E \times \mathcal{F} \to \mathbb{R} such that

For any B\in \mathcal{F} the map x \to K(x,B) is a measurable map.
For any x\in E the map B \to K(x,B) is a probability measure on (F,\mathcal{F}).

Examples

If E and F are countable sets then a kernel is built from a transition matrix K(i,j), i \in E and j \in F such that K(i,j) \ge 0 \text{ and } \sum_{j \in F} K(i,j) =1 We have then K(i,B)= \sum_{j \in B} K(i,j).
Suppose (F,\mathcal{F})=(\mathbb{R},\mathcal{B}) then a kernel is from a “conditional density” k(x,y) where x \in E and y\in \mathbb{R} and such that g is measurable and \int_R k(x,y) dx=1 for all x (see examples below).

The following theorem is a generalization of Fubini-Tonelli

Theorem 9.1 Let P be a probability measure on (E,\mathcal{E}) and K(x,B) a probability kernel from (E,\mathcal{E}) into (F, \mathcal{F}). Then a probability measure R on (E \times F, \mathcal{E}\otimes \mathcal{F}) is defined by \int f(x,y) dR(x,y) = \int_X \left(\int_F f(x,y) K(x,dy)\right) dP(x) for any measurable non-negative f. In particular we have for any A \in \mathcal{E} and B \in \mathcal{F} R( A \times B) = \int 1_A(x) 1_B(y) dR(x,y) = \int 1_A(x) k(x,B) dP(x) = \int_A K(x,B) dP(x)\,. We write R(dx,dy)=P(dx)K(x,dy)

Proof. The proof is very similar to Fubini-Tonelli theorem and is thus omitted.

This theorem is intimately related to the concepts of conditional probability and conditional expectations which we will study later.
Roughly speaking, on nice probability space (e.g. separable metric spaces), every probaility measure on the product space E \times F can be constructed in the way described in Theorem 9.1 (this is a deep result).

Definition 9.2 (marginals of a probability measure) Given a probability measure R on the product space E \times F, \mathcal{E}\otimes \mathcal{F} the marginals of R are on E and F are defined defined to be the measures given by \begin{aligned} P(A) &= R(A \times F)\, \quad A \in \mathcal{E} \\ Q(B) &= R(E \times B)\, \quad B \in \mathcal{F} \end{aligned} Alternatively we can think of the marginals as the image measure P = R \circ \pi_E^{-1}, \quad \quad P = R \circ \pi_F^{-1} where \pi_E and \pi_F are the projection maps \pi_E(x,y) = x and \pi_F(x,y) = y

If R=P \otimes Q is a product measure then the marginal of R are P and Q. This is for the kernel K(x,A)=Q(A).
If R(dx,dy)= P(dx) K(x,dy) then we have R(A \times F) = \int_A K(x,F) dP(x) = P(A) \quad \quad R(E \times B) = \int_E K(x,B) dP(X) so its marginals are P and and Q given by Q(B) = \int_E K(x,B) dP(x).

9.2 Lebsgue measure on \mathbb{R}^n and densities

In previous chapter we have constructed the Lebesgue probability measure P_0 on [0,1] as the unique measure such that P_0[(a,b]]=(b-a). Using this and other uniformly distributed random variables we define the Lebesgue measure on \mathbb{R} and on \mathbb{R}^n

Definition 9.3 The Lebesgue measure on \mathbb{R} is a set function m such that

m(\emptyset)=0.
m\left(\bigcup_{i=1}^\infty A_i \right)=\sum_{i=1}^\infty m(A_i)
m((a,b])=b-a for any a < b.

The Lebsgue measure on \mathbb{R} is not a probability measure since m(\mathbb{R})=+\infty, it is an example of an infinite measure. We can easily construct it using uniform random variables on (n,n+1] with distribution P_{n,n+1} namely we set
m(A) = \sum_{n=-\infty}^\infty P_{n,n+1}(A) The uniqueness of the measure P_{n,n+1} implies that the measure m is unique as well.

By doing a Fubini-Tonelli theorem type argument one can construct the Lebesgue measure on \mathbb{R}^n.

Definition 9.4 If we equip \mathbb{R}^n with the product \sigma-algebra \mathcal{B}\otimes \cdots \otimes \mathcal{B} the Lebesgue measure m_n on \mathbb{R^n} is the product of n Lebesgue measure on \mathbb{R}. We have m_n\left( \prod_{i=1}^n [a_i,b_i]\right)= \prod_{i=1}^n (b_i-a_i)

Notations we often use the notation dx or dx_1 \cdots dx_n for integration with respect to m_n.

Definition 9.5 A probability measure on (\mathbb{R}^n,\mathcal{B}_n) (where \mathcal{B}_n=\mathcal{B} \otimes \cdots \otimes \mathcal{B}) has a density f if f is a nonnegative Borel measurable function and P(A)\,=\, \int_A f(x) dx = \int 1_A(x) f(x) dx = \int f(x_1, \cdots, x_n) 1_A(x_1, \cdots, x_n) dx_1 \cdots dx_n

Theorem 9.2 A non-negative Borel measurable function f(x) is the density of a Borel probability measure if and only if \int f(x) dx =1 and it determines the probability P. Conversely the probability measure determines its density (if it exists!) up to a set of Lebesgue measure 0.

Proof. Given f\ge 0 it is easy to check that P(A) = \int 1_A f(x) dx defines a probbaility measure (same proof as in Exercise 7.3).
Conversely assume that f and f' are two densities for the measure P, then for any measurable set A we have P(A)=\int_A f(x) dx = \int_A f'(x) dx. Consider now the set A_n =\left\{ x : f'(x)\ge f(x)+ \frac1n\right\} Then we have P(A_n)= \int_A f'(x) dx \ge \int_A \left(f(x) + \frac1n\right) dx = P(A) + \frac1n m(A) and therefore m(A_n)=0 and since A_n increases to A=\{f < f'\} we have shown that m(\{f < f'\})=0. By symmetry we have f=f' a.s. \quad \square.

Theorem 9.3 Suppose the random variable (X,Y) has a probability distribution R(dx,dy) with density f(x,y). Then

Both X and Y have densities given respectively by f_X(x)=\int f(x,y) dy\,,\quad \quad f_Y(y)=\int f(x,y) dx \,.
X and Y are independent if and only if f(x,y) = f_X(x)f_Y(y)\,.
If we set k(x,y)= \frac{f(x,y)}{f_X(x)} \text{ if } f_X(x) \not= 0 and this defines a kernel K(x,B)=\int_B k(x,y)dy and the probability distribution(X,Y) is given by R(dx,dy)=f_X(x)k(x,y)dxdy.

Remark It does not matter how K(x,B) is defined for such x where f_X(x)=0 and so we have left it undefined. There are in general many kernels densities which will give the same probability aloowing for changes on sets of zero probbaility.

Proof.

For A \in \mathcal{B} we have P(X\in A) = P(X \in A, Y \in \mathbb{R}) = \int_A \left(\int_R f(x,y) dy\right) dx = \int_A f_X(x) dx and since this holds for all A, f_X(x) is a density for the distribution of X.
If f(x,y)=f_X(x) f_Y(y) then \begin{aligned} P(X \in A, Y \in B)&= \int 1_{A \times B}(x,y) f_X(x) f_Y(y) dx dy = \int 1_{A}(x) f_X(x) dx \int 1_{B}(x) f_Y(y) dy \\ &=P(X\in A)P(Y\in B) \end{aligned} Conversely assume that X and Y are independent. Consider the collection of sets \mathcal{H} =\left\{ C \in \mathcal{B}\otimes \mathcal{B} \,:\, \int_C f(x,y) dx dy = \int_C f_X(x) f_Y(y) dx dy\right\} The independence and Fubini Theorem implies that any set C = A \times B belongs to \mathcal{H}. Since this is a p-system generating the \sigma-algebra the monotone class theorem shows that \mathcal{H}=\mathcal{B}\otimes \mathcal{B}.

We have \int k(x,y) dy = \int \frac{f(x,y)}{f_X(x)} dy = \frac{f_X(x)}{f_X(x)} =1 and thus k(x,y) is a density. Furthermore we have \begin{aligned} P( X \in A, Y \in B) &= \int_A K(x,B) dP(x) = \int_A f_X(x) \left(\int_B k(x,y) dy \right) dx = \int_A f_X(x) \left(\int_B \frac{f(x,y)}{f_X(x)} dy \right) dx \\ & = \int_A \int_B f(x,y) dx dy \end{aligned} and this concludes the proof since the measure is uniquely determined by its value on rectangles (by a monotone class theorem arguemnt).

9.3 Example: Box Muller algorithms

We derive here a method to generate two independent normal random variables using two independent random number. This is a different algorithm that the one using the quantile for the normal RV which is known only numerically.

Theorem 9.4 Suppose that U_1 and U_2 are two independent random numbers then \begin{aligned} X_1 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2) \quad \quad \quad X_2 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2) \end{aligned} are two independent normal random variables with mean 0 and variance 1.

Proof. We use the expectation rule Theorem 6.8 togther with polar coordinates. For any nonnegative function h(x_1,x_2) we have, using polar coordinate x_1 =r \cos\theta and x_2=r \sin \theta and then the change of variable s=r^2/2 \begin{aligned} E[ h(X_1,X_2)] & = \int h(x_1,x_2) f(x_1,x_2) dx_1 dx_2 = \int_{\mathbb{R}^2} h(x_1,x_2) \frac{1}{2\pi} e^{-\frac{\|x\|^2}{2}} dx_1 dx_2 \\ & = \int_{[0,\infty]\times[0,2\pi]} h(r \cos\theta,r \sin\theta) \frac{1}{2\pi}d\theta r e^{-\frac{r^2}{2}}dr \\ & = \int_{[0,\infty]\times[0,2\pi]} h(\sqrt{2s} \cos\theta, \sqrt{2s} \sin\theta) \frac{1}{2\pi}d\theta e^{-s} ds \end{aligned} This computation shows that if S is exponential with parameter 1 and \Theta is uniform on [0,2\pi] then \sqrt{2 S}\cos(\Theta) and \sqrt{2 S}\sin(\Theta) are independent standard normal. But we can write also S=-\ln(U_1) and \Theta=2\pi U_2. \quad \square.

9.4 Exponential mixture of exponential is polynomial

Let us consider an exponential random variable Y whose parameter is itself an exponential random variable X with paramter \lambda>0. That is X has density f(x)= \left\{ \begin{array}{cl} \lambda e^{-\lambda x} & x \ge 0 \\ 0 & \text{else} \end{array} \right. and consider the kernel K(x,dy)=k(x,y)dy with k(x,y) = \left\{ \begin{array}{cl} xe^{-xy} & x>0, y>0 \\ 0 & \text{ else } \end{array} \right. Then (X,Y) has density f(x,y)= \left\{ \begin{array}{cl} \lambda e^{-\lambda x} x e^{-xy} = \lambda x e^{-(\lambda +y)x} & x>0, y>0 \\ 0 & \text{ else } \end{array} \right. Then, using that the mean of an exponential RV is the reciprocal of the paramter, the density of Y is
f(y) = \int_0^\infty f(x,y) dx = \frac{\lambda}{\lambda +y} \int_0^\infty x (\lambda +y) e^{-(\lambda +y)x}= \frac{\lambda}{(\lambda+y)^2} which decays polynomially!

9.5 gamma and beta random variables

Recall that a gamma RV has density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\lambda x} for x \ge 0.

Consider now two independent gamma random RV X_1 and X_2 with paramters (\alpha_1, \beta) and (\alpha_2, \beta). We prove the following facts

Z=X_1 + X_2 is a gamma RV with parameters (\alpha_1+\alpha_2, \beta)
U=\frac{X_1}{X_1+X_2} is a beta distribution with paramters \alpha_1 and \alpha_2 which has the density f(u) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2} \quad 0 \le u \le 1
X_1 + X_2 and \frac{X_1}{X_1+X_2} are independent RV

We use the expectation rule Theorem 6.8 and the change of variable z= x_1+x_2 and u= \frac{x_1}{x_1+x_2} or x_1= uz and x_2=(1-u)z. This maps [0,\infty) \times [0,\infty) to [0,1] \times [0,\infty) and the Jacobian of this transformation is equal z.

We have then for any nonnegative h \begin{aligned} E[h(Z,U)]&=E\left[h\left((X_1+X_2,\frac{X_1}{X_1+X_2}\right)\right]\\ &= \int_0^\infty \int_0^\infty h\left(x_1+x_2,\frac{x_1}{x_1+x_2}\right) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} x_1^{\alpha_1-1} x_2^{\alpha_2 -1} e^{- \beta(x_1 + x_2)} dx_1 dx_2 \\ & = \int_0^1 \int_0^\infty h(z,u) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} (uz)^{\alpha_1-1} ((1-u)z)^{\alpha_2-1} e^{-\beta z} z du dz \\ &= \int_0^1 h(z,u) \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du \int_0^\infty \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1+ \alpha_2)} z^{\alpha_1 + \alpha_2 -1} e^{-\beta z} dz \end{aligned} and this proves all three statements at once.

Remark This is a nice, indirect way, to compute the normalization for the density of the \beta distribution which is proportional to u^{\alpha_1-1}(1-u)^{\alpha_2-1}.

9.6 The beta binomial model and a whiff of Bayesiasn statistics

A beta random variable U has mean E[U] = \int_0^1 u \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1 +1) \Gamma(\alpha_2)}{\Gamma(\alpha_1 +\alpha_2 +1)} = \frac{\alpha_1}{\alpha_1+\alpha_2} and proceeding similarly one finds that {\rm Var}(U)= \frac{\alpha_1 \alpha_2}{(\alpha_1+\alpha_2)^2 (\alpha_1 + \alpha_2+1)}

There is a natural connection between a binomial random variable (discrete) and the beta random variable (continuous) (let us call it P for a change). The pdf looks strangely similar {n \choose j} p^j (1-p)^{n-j} \quad \text{ versus } \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 -1} But p is a parameter for the binomial while it is a variable for the second!

Model:

Make n indepdent trials, each with a (random) probability P.
Take P to have a beta distribution with suitably chosen parameter \alpha_1,\alpha_2.
The mean \frac{\alpha_1}{\alpha_1+\alpha_2} is your average guess for the “true” probability p and by adjusting the scale you can adjust the variance (uncertainty) associated with your guess.

This leads to considering a random variable (X,P) taking value in \{0,1,\cdots,n\} \times [0,1] with a “density” f(j,p) = \underbrace{{n \choose j} p^j (1-p)^{n-j}}_{=k(p,j) \text{ kernel }} \underbrace{\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1-1} (1-p)^{\alpha_2-1} }_{=f(p)} which is normalized \sum_{j=0}^n \int_0^1 f(k,p) dp =1

The beta-binomial distribution with paramters (n,\alpha_1,\alpha_2) is the marginal distribution of X on \{0,1,\cdots,n\} given by \begin{aligned} f(j) = \int_0^1 f(j,p) dp & = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \int_0^1 p^{\alpha_1+n -1}(1-p)^{\alpha_2+n-k-1} \\ & = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1+j) \Gamma(\alpha_2+n-j)}{\Gamma(\alpha_1+\alpha_2+n)} \end{aligned}

Bayesian statisitics framework

We interpret the distribution f(p) with parameter \alpha_1 and \alpha_2 as the prior distribution which describe our beliefs before we do any experiment. For example \alpha_1=\alpha_2=1 correspond to a uniform distribution on p (that is we are totally agnostic, a fine choice if you know nothing).
The beta-binomial which is the marginal f(j)=\int_0^1 f(j,p) dp describe the distribution of the independent trials under this model. It is called the evidence.
The kernel k(p,j) is called the likelihood which describes the number of success, j, given a certain probability of success, p. It is called the likelihood when we view it as a function of p and think of j as a parameter.
Now we can write the distribution using kernels in two ways: f(j,p)= k(p,j)f(p) = k(j,p) f(j) and the kernel k(j,p) is called the posterior distribution. It is interpreted as the distribution of p given that j trials have occured.

We can rewrite this (this is just a version of Bayes theorem) as \text{posterior} = k(j,p) = \frac{k(p,j)f(p)}{f(j)} = \frac{\text{likelihood}\times\text{prior}}{\text{evidence}}\,.
For the particular model at hand k(j,p) \propto p^{j} (1-p)^{n-j} p^{\alpha_1-1} (1-p)^{\alpha_2-1} \propto p^{\alpha_1+j-1} (1-p)^{\alpha_2+ n- j-1} and therefore k(j,p) has a binomial distribution with parameter \alpha_1 +j and \alpha_2+n-j.
This is special: the prior and posterior distribution belong to the same family. We say that we have conjugate priors and this is a simple model to play. with. In general we need Monte-Carlo Markov chains to do the job.
Example: Amazon seller ratings You want to buy a book online

Vendor 1 has 151 positive rating and 7 negative rating (95.5%).
Vendor 2 has 946 positive ratings and 52 negative ratings (94.7%).
Uniform prior with \alpha_1=\alpha_2=1 gives two beta posterior with \alpha_1=152 and \alpha_2=8 and \alpha_1=947 and \alpha_2=53.

9.7 Exercises

Exercise 9.1

Suppose that X is a gamma RV with parameter \alpha and \beta and Z is an exponential RV with parameter 1 and suppose further that Z and X are independent. Find the CDF (and then the PDF) of Y=\frac{Z}{X}.
Proceeding as in Section 9.4 consider an exponential RV Y whose parameter X has a gamma distribution with paramters \alpha and \beta. Find the marginal distribution of Y.
Compare 1. and 2. and explain why they give the same result.

Exercise 9.2 (Poisson-Gamma model) Consider a Poisson RV X with a random parameter \Lambda which itself has a gamma distribution (for some parameters (\alpha,\beta).)

What are the joint density and the kernel, f(j,\lambda)=k(\lambda,j) f(\lambda), for the pair (X,\Lambda).
What is the density f(j) of X? (the “evidence”)?
What is the “posterior distribution” (that is what is the kernel k(j,\lambda) if we write f(j,\lambda)= k(j,\lambda)f(j))?

10 The characteristic and moment generating function for a random variable

One of the oldest trick in the mathematician toolbox is to obtain properties of a mathemtical object by perfomring a transformation on that object to map it into another space. In analysis (say for ODE’s and PFE’s) the Fourier transform and the Laplace transform play avery important role. Both play an equally important role in probability theory!

Fourier transform of a probbaility measure leads to a proof of the central limit theorem!
Laplace transform (via Chernov bounds) leads to concentration inequalities and performance guarantees for Monte-Carlo methods and statistical learning.

10.1 Fourier transform and characteristic function

Notation For vectors x,y \in \mathbb{R}^n we use the notation \langle x, y\rangle = \sum_{i=1}^n x_i y_i for the scalar product in \mathbb{R}^n

Definition 10.1 (Fourier transform and characteristic function)

For a probability measure P on \mathbb{R}^n the Fourier transform of P is a function \widehat{P}(t): \mathbb{R}^n \to \mathbb{C} given by
\widehat{P}(t) = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP(x)
For a random variable X taking value in \mathbb{R}^n the characterictic function of X is the Fourier transorm of P^X (the distribution of X): we have \phi_X(t)= E\left[e^{i\langle t,x \rangle}\right] = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP^X(x)

Remarks

We have not talked explicitly about integration of complex valued function h=f+ig where f and g are the real and imaginary part. It is simply defined as \int (f+ig)dP = \int f dP + i \int g dP provided f and g are integrable. A complex function h is integrable iff and only if |h| is intergable if and only if f and g are integrable. (The only thing a bit hard to prove is the triangle inequality |\int h dP|\le \int|h| dP.)
The function e^{i\langle t,x \rangle} = \cos(\langle t,x \rangle) + i \sin (\langle t,x \rangle) is integrable since \sin(\langle t,x \rangle) and \cos(\langle t,x \rangle) are bounded function (thus in L^1 \subset L^\infty) or by noting that |e^{i\langle t,x \rangle}|=1.
Suppose the measure P has a density f(x) then we have \widehat{P}(t)= \int e^{i\langle t,x \rangle} f(x) dx which is simply the Fourier transform (usually denoted by \widehat{f}(t)) of the function f(x). (Diverse conventions are used for the Fourier, e.g. using e^{-i 2\pi\langle k,x \rangle} instead e^{i\langle t,x \rangle})

10.2 Analytic properties of the Fourier transform

We turn next to the properties of the Fourier transform. A very useful thing to remember of the Fourier transform

\text{ the smoother the Fourier transform is the faster the function (or the measure) decay and vice versa}

The next two theorem makes this explicit in the context of measures. The first one is very general and simply use that we dealing with probaility measure

Theorem 10.1 The Fourier transform \widehat{P}(t) of a probability measure P is uniformly continuous on \mathbb{R}, and satisfies \widehat{P}(0)=1 and |\widehat{P}(t)|\le 1.

Proof. Clearly \widehat{P}(0)= \int_{\mathbb{R}^n} dP(x)=1 since P is a probability measure and since |e^{i\langle t,x \rangle}|=1, by the triangle inequality |\widehat{P}(t)|\le 1.

For the uniform contintuity we have \begin{aligned} |\widehat{P}(t+h) -\widehat{P}(t)| \le \int |e^{i\langle (t+h),x \rangle} - e^{i\langle t,x \rangle} | dP(x) = \int |e^{i\langle t,x \rangle}| | e^{i\langle h,x \rangle} - 1 | dP(x) = \int | e^{i\langle h,x \rangle} - 1 | dP(x) \end{aligned} The right hand side is independent of t which is going to show uniformity. To conclude we need to show that the right hand-side goes to 0 as h \to 0. We can use dominated convergence since \lim_{h \to 0} e^{i\langle h,x \rangle} =1 \text{ for all } x \quad \quad \text{ and } \quad \quad | e^{i\langle h,x \rangle} - 1 | \le 2 \quad \square

As we have seen in Chapter 7 the L^p spaces are form a dcereasing sequence L^1 \supset L^2 \supset L^3 \supset \cdots L^\infty and the next theorem show that if some random variable belongs to L^n (for some integer n) then its chararcteristic function will be n-times continuously differentiable.

Theorem 10.2 Suppose X is RV taking value in \mathbb{R}^n and is such that E[|X|^m] < \infty. Then the characteristic function \phi_X(t)=E[e^{i\langle t, X \rangle}] has continuous partial derivative up to oder m, and for any k \le m, \frac{\partial^k\phi_X}{ \partial{x_{i_1}}\cdots \partial{x_{i_k}} }(t) = i^k E\left[X_{i_1} \cdots X_{i_k} e^{i\langle t, X \rangle}\right]

Proof. We will prove only the m=1 case, the rest is proved by a tedious induction argument. Denoting by e_i the basis element \begin{aligned} \frac{\partial\phi_X}{ \partial x_i}(t) = \lim_{h \to 0} \frac{\phi_X(t+he_i)- \phi_X(t)}{h} = \lim_{h \to 0} E \left[ \frac{1}{h}\left(e^{i\langle t+ he_i, X \rangle} - e^{i\langle t, X \rangle}\right) \right] = \lim_{h \to 0} E \left[ e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h} \right] \end{aligned} To exchange the limit and expectation we use DCT argument and the bound |e^{i\alpha} -1| = \left| \int_0^\alpha \frac{d}{ds} e^{is} \right| \le \int_0^\alpha |i e^{is}|\le |\alpha|\,. From this we see that \left|e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h}\right| \le |X_i| which is integrable and independent of h. The DCT concludes the proof. \quad \square

10.3 More properties

Two simple but extremly useful properties:

Theorem 10.3 If X takes value in \mathbb{R}^n, b \in \mathbb{R}^m and A is a m\times n matrix then \phi_{AX+b}(t) = e^{i \langle t, b\rangle}\phi_{X}(A^Tt)

Proof. This simply follows from the equality
e^{i \langle t, AX +b \rangle} = e^{i \langle t,b \rangle} e^{i \langle A^T t,X \rangle}\,. \quad \square

Theorem 10.4 Suppose X and Y are independent RV taking values in \mathbb{R}^n then \phi_{X+Y}(t)=\phi_{X}(t) \phi_Y(t)

Proof. By independence E\left[ e^{i \langle t,X+Y \rangle} \right] =E\left[ e^{i \langle t, X \rangle} e^{i \langle t, Y \rangle}\right] =E\left[ e^{i \langle t, X \rangle}\right]E\left[ e^{i \langle t,Y \rangle}\right] \quad \square

10.4 Examples

Bernoulli with parameter p: \phi_X(t)=E[e^{itX}]= e^{it}p + e^{i0}(1-p)= (1-p) + e^{it}p\,.
Binomial with paramters (n,p): using the binomial theorem \phi_X(t)=E[e^{itX}]= \sum_{k=0}^n {n \choose k} e^{itk}p^{k} (1-p)^{n-k} =(e^{it}p + (1-p))^n
Poisson with paramters \lambda: \phi_X(t)=E[e^{itX}]= e^{-\lambda} \sum_{k=0}^\infty e^{itk} \frac{\lambda^k}{k!} = e^{\lambda(e^{it}-1)}
Normal with paramters \mu, \sigma^2: We start with the special case \mu=0 and \sigma^2=1 and we need to compute the complex integral \phi_X(t)=E[e^{itX}]= \frac{1}{\sqrt{2\pi}}\int e^{itx} e^{-x^2}{2} dx You can do it via contour integral and residue theorem (complete the square!). Instead we use an ODE’s argument.

First we note that by symmetry
\phi_X(t)= \frac{1}{\sqrt{2\pi}}\int \cos(tx) e^{-x^2/2} dx By Theorem 10.2 we can differentiate under the integral and find after integrating by part \phi_X'(t)=\frac{1}{\sqrt{2\pi}}\int -x \sin(tx)e^{-x^2/2} dx = \frac{1}{\sqrt{2\pi}}\int -t \cos(tx)e^{-x^2/2} dx =-t \phi_X(t) a separable ODE with initial condition \phi_X(0)=1. The solution is easily found to be e^{-t^2/2} and thus we have \phi_X(t)=E\left[e^{itX}\right]= e^{-t^2/2}. Finally noting that Y=\sigma X +\mu is a normal is with mean \mu and \sigma we find by Theorem 10.3 that \phi_Y(t)=e^{i\mu t}E\left[e^{i\sigma t X}\right]= e^{i\mu t -\sigma^2t^2/2}.

Exponential with parameter \beta: For any (complex) z we have \int_a^b e^{z x} dx = \frac{e^{z b} - e^{za}}{z}. From this we deduce that \phi_X(t) = \beta \int_0^\infty e^{(it-\beta)x} dx = \frac{\beta}{\beta-it}

10.5 Uniqueness Theorem

We show now that the Fourier transform determines the probability measures uniquely, that is if two probability measures P and Q have the same Fourier transforms \widehat{P}=\widehat{Q} then they must coincide P=Q. For simplicity we only consider the 1-d case but the proof extends without problem.

There exists several version of this proof (see your textbook for one such proof). We give here a direct proof which also gives an explcit formula on how to reconstruct the measure from the its Fourier transform.

Our proof relies on the following computation of the so-called Dirichlet integral

Lemma 10.1 (Dirichlet integral) For T > 0 let S(T)= \int_0^T \frac{\sin{t}}{t} \, dt. We have then \lim_{T \to \infty} S(T) = \frac{\pi}{2}

This is a fun integral to do and can be done using a contour integral in the complex plane, or by a Laplace transform trick, or by the so-called Feynmann trick (add a parameter and differentiate). See for example the Wikipedia page.

We will take this result for granted and note that we have \int_0^T \frac{\sin(\theta t)}{t} dt = \rm{sgn}(\theta) S( |\theta| T) where {\rm sgn}(\theta) is +1,0 or -1 is \theta is positive, 0, or negative.

Theorem 10.5 (Fourier inversion formula) If a and b are not atoms for P we have P((a,b])= \lim_{T \to \infty} \frac{1}{2\pi} \int_{-T}^T \frac{e^{-ita} - e^{-itb}}{it} \widehat{P}(t) \, dt \tag{10.1} In particular distinct probability measures cannot have the same characteristic function.

Proof. The inversion formula implies uniqueness. The collections of (a,b] such that a and b are not atoms is a p-system which generates \mathcal{B} so the monotone class theorem implies the result, see Theorem 3.2. (See exercise for more on atoms).

Let I_T denote the integral in Equation 10.1. Using the bound |e^{iz}-e^{iz'}|\le |z-z'|s we see that that the integrand is bounded and thus by Fubini’s theorem we have I_T = \frac{1}{2\pi} \int_{-\infty}^\infty \left( \int_{-T}^T \frac{e^{it(x-a)} - e^{it(x-b)}}{it} \, dt \right) dP(x) Using Euler formula and the fact that \cos is even \sin is odd we find \begin{aligned} I_T &= \int_{-\infty}^\infty \left( \int_{0}^T \frac{ \sin(t(x-a)) -\sin(t(x-b)) }{\pi t} \, dt \right) dP(x) \\ &= \int_{-\infty}^\infty \frac{{\rm sgn}(x-a)}{\pi} S(T|x-a|)-\frac{{\rm sgn}(x-b)}{\pi} S(T|x-b|) dP(x) \end{aligned} \tag{10.2}

The integrand in Equation 10.2 is bounded and converges as T \to \infty to the function \psi_{a,b}(x)= \left\{ \begin{array}{cl} 0 & x < a \\ \frac{1}{2} & x=a \\ 1 & a < x < b \\ \frac{1}{2} & x=b \\ 0 & x >b \\ \end{array} \right. By DCT we have that I_T \to \int \psi_{a,b} dP = P((a,b]) if a and b are not atoms. \quad \square

You can use the Fourier inversion formula to extract more information.

Theorem 10.6 Suppose the Fourier transform \widehat{P}(t) is integrable, \int |\widehat{P}(t)| dt < \infty then P has a density f(x)=\int e^{-itx} \widehat{P}(t) dt.

Proof. Using that |\frac{e^{-ita} - e^{-itb}}{it}|\le |b-a| the fact that |\widehat{P}(t)| is integrable means we can extend the integral in Equation 10.1 to an integral from -\infty to \infty. As a consequence we get P((a,b)) \le |b-a| \int_{-\infty}^\infty |\widehat{P}(t)| \, dt and thus P has no atoms. Furthermore for the CDF F(t) of P we have for h negative or positive \frac{F(x+h) - F(x)}{h} =\frac{1}{2\pi}\int_{-\infty}^\infty \frac{e^{-itx} - e^{-it(x+h)}}{ith} \widehat{P}(t) dt The integrand is dominated by |\widehat{P}(t)| and by DCT F is differentiable and P has density F'(x)= f(x)= \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx} \widehat{P}(t) dt. \quad \square

10.6 Examples

We can use the inversion theorem in creative ways.

Examples:

The Laplace RV Y is a two-sided version of the exponential RV. Its density is f(x) = \frac{\beta}{2}e^{-\beta|x|}. You can think of the Laplace distribution as the mixture (with mixture parameters \frac12, \frac12) of an exponential RV X and -X where X is exponential. Its characteristic function is then \phi_Y(t)= E[e^{itY}]= \frac{1}{2}E[e^{itX}] + \frac{1}{2}E[e^{-itX}] = \frac{1}{2} \frac{\beta}{\beta -it} + \frac{\beta}{\beta +it} =\frac{\beta^2}{\beta^2 + t^2}
The Cauchy RV Z had density f(x)=\frac{\beta}{\pi(x^2 + \beta^2)} and its characteristic function is given by \phi_Y(t)= E[e^{itY}]= \int_{\infty}^\infty e^{itx}\frac{\beta}{\pi(x^2 + \beta^2)} a priori not an easy integral. However notice that the Fourier transform of the Laplace looks (up to constants) exactly like the density of a Cauchy! So we using Theorem 10.6 for the Laplace distribution shows that \frac{\beta}{2} e^{-\beta|x|} = \frac{1}{2\pi}\int e^{-itx} \phi_Y(t) dt = \frac{1}{2\pi}\int e^{-itx} \frac{\beta^2}{\beta^2 + t^2}dt = \frac{\beta}{2} \int e^{itx} \frac{\beta}{\pi(\beta^2 + t^2)}dt =\frac{\beta}{2} \phi_Z(x) from which conclude that \phi_Z(t)= e^{-\beta|t|}.

10.7 Sum of independent random variables

Suppose X and Y are independent random variables. We wish to understand what is the distribution of X+Y. The first tool is to use the characteristic function and the fact that if X and Y are independent E[ e^{it(X+Y)}] = E[e^{itX}] E[e^{itY}] together with the uniquness theorem.

Examples

Suppose X is normal with paramter \mu and \sigma^2 and Y normal with paramter \nu and \eta^2. Then if X and Y are independent then X+Y is normal with paramter \mu+\nu and \sigma^2 + \eta^2. This follows form the uniqueness theorem and E[e^{it(X+Y)} ] = E[e^{itX}] E[e^{itY}] = e^{i\mu t - \sigma^2t^2/2} e^{i\nu t - \eta^2t^2/2} = e^{i(\mu+\nu) t - (\sigma^2+\eta^2) t^2/2}
Suppose X_1, \cdots, X_n are independent Bernoulli RV with paramters p then X_1+ \cdots + X_n is a binomial RV. Indeed we have E[e^{it(X_1+\cdots+X_n)} ]= E[e^{itX_1}] \cdots E[e^{itX_n}] =(e^{it}p +(1-p))^n

Another tool is the following convolution theorem

Theorem 10.7 (Convolution of probability measures) Assume X and Y are independent random variables.

If X and Y have distribution P^X and P^Y then X+Y has the distribution P^X \star P^Y (A) = \int \int 1_A(x+y) dP^X(x) dP^Y(y) \quad \text{ convolution product}
If X and Y have densities f_X(x) and f_Y(y) then X+Y has the density f_{X+Y}(z)= \int f_X(z-y) f_Y(y) dy = \int f_X(x) f_Y(z-x) dx

Proof. For the first part let us take a non-negtive function h and set Z=X+Y. We have then E[h(Z)]= E[h(X+Y)]= \int h(x+y) P^X(dx) P^Y(dy) Taking h=1_A give the result.

For the second part if X and Y have a density we have \begin{aligned} P(Z \in A) =E[1_A(Z)] & = \int \int 1_A(x+y) f_X(x) f_Y(y) dx dy \\ & = \int \int 1_A(z) f_X(z-y) f_Y(y) dz dy \quad \text{ change of variables } z=x+y, dz = dx \\ &= \int \left(\int f_X(z-y) f_Y(y) dy\right) 1_A(z) dz \quad \text{ Fubini } \end{aligned} and since this holds for all A, Z has the claimed density. The second formula is proved in the same way. \quad \square.

Example: triangular distribution
Suppose X and Y are independent and uniformly distributed on [-\frac{1}{2},\frac12]. Then Z=X+Y is between -1 and +1 for z \in [-1,1] we have f_Z(z)= \int_{-\infty}^\infty f_X(z-y) f_Y(y) dy = \left\{ \begin{array}{cl} \int_{-\frac12}^{z+ \frac12} dy & -1\le z \le 0 \\ \int_{z-\frac{1}{2}}^{\frac12} dy & 0 \le z \le 1 \end{array} = 1- |z| \right.

10.8 Moment and moment generating functions

The moment problem is the question whether a probability distribution P is uniquely determined by all its moments E[X^n]. It is in general not true as the following examples shows.

Recall the log-normal distribution is the distribution of e^X is X is a normal distribution. Its pdf is given, for \mu=0 and \sigma^2=1 f(x)= \frac{1}{x \sqrt{2 \pi}} e^{- \frac{\ln(x)^2}{2}} and all moments exists E[X^r]=\int_0^\infty x^k f(x) dx = e^{r^2/2}.
Now consider g(x) = f(x)(1 + \sin(2 \pi \ln(x))) Then for k=0,1,2, \cdots we have with the change of variable \ln(x)= s+k \int_0^\infty x^k f(x) \sin(2 \pi \ln(x)) dx = \frac{1}{\sqrt{2 \pi}} e^{k^2/2}\int_{-\infty}^\infty e^{-s^2/2} \sin(2 \pi s) ds =0 \,. This shows that g is the density of a RV Y and that all moments of Y coincide with the moments of the log-normal!

A stronger condition on the moments do imply uniqueness: if all moments exists and E[X^n] do not grow to fast with n then the moments do determine the distribution. This use a analytic continuation argument and relies on the uniqueness theorem for the Fourier transform.

Theorem 10.8 Suppose X and Y are RV such that the moment generating functions M_X(t)=M_Y(t) in [-t_0,t_0] and are finite in that interval. Then X and Y have the same distribution.

Proof.

Since e^{t|x|} \le e^{tx}+ e^{-tx} and the right hand-side is integrable, the function e^{t|X|}=\sum_{k=0}^\infty \frac{s|X|^k}{k!} is integrable. By the DCT (for sums of RVs, in the form of Exercise 7.2) we have E[e^{tX}]=\sum_{k=0}^\infty \frac{t^k E[X^k]}{k!}.
This implies that \frac{t^k E[X^k]^k}{k!} \to 0 as k \to \infty (for |t|\le t_0). We claim that this implies that \frac{s^k E[|X|^k]^k}{k!}\to 0 as k \to \infty as long as s < t_0. If k is even E[|X|k]=E[X^k] and there is nothing to do. For k odd we use on one hand that |X|^{2k-1} \le 1 + |X|^2k as well that s< t we have (for k sufficiently large) 2k s^{2k-1} < t^2k. Together this shows \frac{s^{2k-1} E[|X|^k]^k}{k!} \le \frac{t^{2k} E[X^k]^k}{k!} \quad \text{ for $k$ large enough.}

The next piece is the Taylor expansion theorem with reminder for function which are n-times continuously differentiable f(x)= \sum_{k=0}^n f^{(k)}(x_0) \frac{(x-x_0)^k}{k!} + \int_{x_0}^x \frac{f^{(n+1)}(t)}{n!}(x-s)^n dt from which we obtain
\left|e^{itx}\left( e^{ihx} - \sum_{k=0}^n \frac{(ix)^k}{k!}\right)\right| \le \frac{|h x|^{n+1}}{(n+1)!}
Integrating with respect to P and taking n \to \infty together with Theorem 10.2 gives \phi_X(t+h)= \sum_{k=0}^\infty \phi_X^{(k)}(t) \tag{10.3} for |h| < t_0.
Now suppose X and Y have the same moment generating function in a neighborhood of 0. Then all their moments coincide and thus by Equation 10.3 (with t=0), \phi_X(t)=\phi_Y(t) on the interval (-r_0, r_0). By Theorem 10.2 their derivatives must also be equal on (-t_0,t_0). Using now Equation 10.3 (with t=-t_0+\epsilon and t=t_0-\epsilon shows that \phi_X(t)=\phi_Y(t) on the interval (-2t_0, 2t_0). Repeating this argument \phi_X(t)=\phi_Y(t) for all t and thus by Theorem 10.5 X and Y must have the same distribution. \quad \square

10.9 Exercises

Exercise 10.1

Show that a characteristic function \phi_X(t) satifies \overline{\phi_X(t)}=\phi_X(-t) (complex conjugate).
Show that a characteristic function \phi_X(t) is real if and only if the random variable X is symmetric (i.e X and -X have the same distribution)
Show that if \phi_X is the characteristic function for some RV X then \phi_X^2(t) and |\phi_X(t)|^2 are characteristic function as well. What are the corresponding RVs?

Exercise 10.2 In this problem we study the characteristic function for a Gamma random variable with parameter \alpha and \beta and density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}. In particular you will prove that \phi_X(t)=\left(\frac{\beta}{\beta-it}\right)^\alpha.

First show that it is enough to consider the case \beta=1 (change scale.)
Use the moments E[X^n] to show that \phi_X(t)= \sum_{n=0}^\infty (it)^n \frac{\Gamma(\alpha+n)}{\Gamma(\alpha) n!} and use then the binomial series for (1-it)^{-\alpha}.
Use your result to show that
- The sum of two independent gamma random variables with parameters (\alpha_1,\beta) and (\alpha_2,\beta) is a gamma random variable.
- If X_1, X_2, \cdots, X_n are independent normal random variable with mean 0 and variance \sigma^2. show that X_1^2 + \cdots + +X_n^2 is a gamma random variable and find the parameters.

Exercise 10.3 Show that if X and Y are RV values taking in the positive integers with distributions P^X(n) and P^Y(n) and are independent then X+Y has distribution P^{X+Y}(n)=\sum_{k=0}^k P^X(k) P^Y(n-k) (this is called the convolution product of the two sequences P^X and P^Y).

Exercise 10.4

In Theorem 10.5, modify the statement of the theorem if a or b are atoms.
Show that P(\{a\}) = \lim_{T \to \infty} \frac{1}{2T}\int_{-T}^T e^{-ita} \widehat{P}(t) dt Hint: Imitate the proof of the inversion formula in Theorem 10.5
Suppose a RV X takes integer values in \mathbb{Z}, show that P(X=n)= \frac{1}{2\pi}\int_{-\pi}^{\pi} e^{-itn} \phi_X(t) dt Hint: Show that \phi_X(t) is periodic

11 Convergence of random variables

We think of a (real-valued) random variable as a function X(\omega) and so if we have a sequence of RVs \{ X_n\} we can define various types of convergence.

Convergence almost sure
Convergence in probability
Convergence in L^p

In subsequent chapters we are will study another type convergence, namely weak convergence (also called convergence in distribution). It is of quite different type because it based on on the distribution of X and not on the notion of random variable as a function.

11.1 Almost sure convergence

Definition 11.1 (Almost sure convergence) A sequence of RVs \{X_n\} converges almost surely to a RV X if \lim_{n} X_n(\omega) = X(\omega) \text{ a.s} that is P(\{\omega\,:\, \lim_{n} X_n(\omega) = X(\omega)\})=1.

Almost sure convergence is pointwise convergence, the limit is unique if of course we identify RV which are equal a.s.

It will be very useful to rephrase almost sure convergence in a different way. At first sight it looks a bit strange but it is a good idea. We explain the idea first for a sequence \{x_n\} of numbers. Consider the function defined for any \epsilon>0 i_\epsilon(x) = 1_{(\epsilon,\infty)}(x) = \left\{ \begin{array}{cl} 1 & x > \epsilon \\ 0 & x \le \epsilon \end{array} \right.

Lemma 11.1

A sequence \{x_n\} converges to x if and only if \sum_n i_\epsilon(|x_n-x|) < \infty for every \epsilon.
If there is a non-negative sequence \epsilon_n such that \sum_{n}\epsilon_n <\infty and \sum_{n} i_\epsilon(|x_n-x_{n+1}|) < \infty then x_n converges to some x.

Proof.

Fix \epsilon>0. If x_n \to x then there exists N such that for any n \ge N, |x_n-x|\le \epsilon which means i_\epsilon(|x_n-x|) =0 for n \ge N and thus \sum_n i_\epsilon(|x_n-x|) < \infty. Conversely if \sum_n i_\epsilon(|x_n-x|) < \infty then, since the terms are either 0 or 1, only finitely many terms can be nonzero and thus there exists N such that |x_n-x| \le \epsilon for n \ge N. \quad \square
If this holds there exists N such that |x_n-x_{n+1}|\le \epsilon_n for n \ge N. Taking n>m \ge N gives |x_n-x_m| \le |x_n-x_{n-1}|+ \cdots + |x_{m+1}-x_m| \le \epsilon_m + \cdots + \epsilon_n \le \sum_{j=m}^\infty \epsilon_j Since the the sequence \epsilon_n is summable \sum_{j=m}^\infty \epsilon_j goes to 0 as m \to \infty. Therefore the sequence x_n is a Cauchy sequence and thus x_n \to x.

Returning to random variables we find

Theorem 11.1 The sequenceof RV X_n converges almost surely if and only if, for every \epsilon>0, \sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \quad \text{ almost surely} \tag{11.1}

Proof.

Let \Omega_0 the set on which convergence holds, then the sum in Equation 11.1 converges for \omega \in \Omega_0 (which is independent of \epsilon).
For the converse the only small issue to deal with is that the set on which the sum converges may depend on \epsilon. So pick a sequence \epsilon_k \searrow 0 and let N_k the value of the sequence in Equation 11.1 and we have P(N_k < \infty)=1. Since \epsilon_{k+1}\le \epsilon_k, i_{\epsilon_{k+1}}\ge i_{\epsilon_{k}} and so N_{k+1}\ge N_k. The events \{N_k < \infty\} are shrinking to \Omega_0=\{ \omega\,:\, \sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \text{ for all } \epsilon >0 \} By sequential continuity P(\Omega_0)=\lim_{k} P(N_k < \infty)=1 and thus X_n converges to X almost surely. \quad \square.

11.2 Borel-Cantelli Lemmas

The Borel-Cantelli Lemma is very simple, and very useful

Lemma 11.2 (Borel Cantelli Lemma) Suppose B_n is a collection of events. Then \sum_n P(B_n) < \infty \implies \sum_{n} 1_{B_n} < \infty \text{ almost surely}\,.

Proof. By the monotone convergence (applied to sum) E[\sum_{n} 1_{B_n}]=\sum_{n} E[1_{B_n}] =\sum_n P(B_n) < \infty. This implies that \sum_{n} 1_{B_n}< \infty almost surely.

There is a converse to Borel-Cantelli Lemma. To a set B we can associate a Bernoulli RV X=1_{B} and vice versa. The Kolmogorov 0-1 law tells us that if the B_i (and so the X_i) are independent then either \sum_n {X_n}<\infty or \sum_n {X_n}=\infty almost surely. Borel-Cantelli Lemma shows that (even without independence assumption) the fist case occurs. The next results shows a partial converse

Lemma 11.3 (Borel Cantelli Lemma) Suppose X_n are Bernoulli random variables.

If \sum_n E[X_n] < \infty then \sum_{n}X_n <\infty almost surely.
If \sum_n E[X_n] = \infty and the X_n are pairwise independent then \sum_{n} X_n =\infty almost surely.

Proof. The first part is just the Borel-Cantelli Lemma Lemma 11.3. The proof of the second part is subtle. We let S_n=X_1+ \cdots + X_n and S=\lim_{n}S_n. Using the pairwise independence we have {\rm Var}(S_n) = \sum_{i=1}^n {\rm Var}(X_i) = \sum_i p_i (1-p_i) \le \sum_{i=1}^n p_i where p_i=E[X_i].

We set further a_n = \sum_{i=1}^n p_i=\sum_{i=1}^n E[X_i]. Then by assumption a_n \nearrow \infty. This implies that for any b>0 we also have a_n - \sqrt{b a_n} \nearrow \infty and so \{ S \le a_n - \sqrt{b a_n} \} \nearrow \{S < \infty\}.

Since S_n \le S we have \{ S \le a_n - \sqrt{b a_n} \} \subset \{ S_n \le a_n - \sqrt{b a_n} \} \subset \{ |S_n -a_n| >\sqrt{b a_n} \} Hence, by sequential continuity and Chebyshev inequality, \begin{aligned} P( S < \infty) & = \lim_{n} P( S \le a_n - \sqrt{b a_n}) \le \limsup P(S_n \le a_n - \sqrt{b a_n}) \le \limsup P(|S_n -a_n| >\sqrt{b a_n}) \\ &\le \limsup_{n} \frac{{\rm Var(S_n)}}{b a_n} \le \frac{1}{b} \end{aligned} Since b is arbitrary S=\infty almost surely.

From the Borel-Cantelli Lemma we obtain useful criteria for almost sure convergence.

Theorem 11.2

If, for any \epsilon> 0, \sum_{n} P(|X_n-X|\ge \epsilon) < \infty then X_n \to X almost surely.
If there exists a sequence \epsilon_n \searrow 0 such that \sum_{n} P(|X_n-X|\ge \epsilon_n) < \infty, then X_n \to X almost surely.
If there exists a sequence \epsilon_n with \sum_{n}\epsilon_n <\infty and \sum_{n} P(|X_n-X_{n+1}|\ge \epsilon_n) < \infty, then X_n \to X almost surely.

Proof.

By Borel-Cantelli we have \sum_{n} i_\epsilon(|X_n-X|) < \infty a.s. which means a.s convergence.
By the Borel-Cantelli Lemma we have |X_n-X| \le \epsilon_n for all but finitely many n. almost surely. Since \epsilon_n \to 0 this means that X_n converges to X a.s
By Lemma 11.1 and Borel-Cantelli \{X_n\} is a Cauchy sequence a.s. and thus converges a.s.

11.3 Convergence in Probability

Definition 11.2 (Almost sure convergence) A sequence of RVs \{X_n\} converges in probability to a RV X if, for any \epsilon> 0, \lim_{n \to \infty} P( \{\omega\,:\, |X_n(\omega) - X(\omega)|> \epsilon\}) =0 \,.

This is a very useful mode of convergence in probability, in particular due to the fact that, that it is weaker than almost sure convergence and thus easier to prove.

Example Let \Omega=[0,1) and P_0 be Lebesgue measure. Consider the sequence of RV X_1 = 1_{[0,\frac12)}, X_2 = 1_{[\frac12,1)}, X_3 = 1_{[0,\frac13)}, X_4 = 1_{[\frac13,\frac23)}, X_5 = 1_{[\frac23,1)},, X_6 = 1_{[0,\frac14)},X_7 = 1_{[1/4,\frac24)} \cdots \tag{11.2}

We claim that X_n converges to 0 in probability. Indeedm for any \epsilon>0, P( X_n > \epsilon ) =P(X_n=1) \to 0 since the X_n=1_{I_n} is a charactersitic function of an interval I_n whose length goes to 0 as n goes to infinity.
The sequence X_n does not converge a.s. Indeed \omega \in [0,1) belong to infinitely many intervals of the form [\frac{k}{n},\frac{k+1}{n}) and does not belong to infinitely many such intervals. Therefore \liminf X_n(\omega)=0 \not= \limsup X_n(\omega)= 1.
What we can see also is that the sequence X_n has (many!) convergent subsequence which converges to 0 almost surely. To do this take pick n_k such that the interval I_{n_k} for X_{n_k}=1_{I_{n_k}} is contained in the interval I_{n_{k-1}} for X_{n_{k-1}}= 1_{I_{n_{k-1}}}.

The relation between almost sure convergence and convergence in probability is contained in the following theorem.

The third part, while looking somewhat convoluted is very useful to streamline subsequent proofs. It relies on the following simple fact: suppose that the sequence x_n is such that every subsequence has a subsubsequence which converges to x then x_n converges to x.

Theorem 11.3 (Almost sure convergence versus convergence in probability)

If X_n converges almost surely to X then X_n converges in probability to X.
If X_n converges in probability to X then there exists a subsequence X_{n_k} which converges to X almost surely,
If every subsequence has a further subsubsequence which converges to X almost surely, then X converges to X in probability.

Proof.

If X_n converges to X almost surely then i_\epsilon(|X_n-X|) converges to 0 almost surely. By the bounded convergence theorem this implies E[i_\epsilon(|X_n-X|)]=P(|X_n-X|\ge \epsilon) converges to 0.
If X_n converges to X in probability then let us pick \epsilon_k =\frac{1}{k} \searrow 0. Since P(|X_n(\omega) - X(\omega)|> \epsilon_k) \to 0 as n \to \infty we can find n_k such that P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \frac{1}{2^k}. Therefore \sum_{k=1}^\infty P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \sum_{k} \frac{1}{2^k} < \infty \,. By part 2. of Theorem 11.2 this shows that X_{n_k} converges almost surely to X.

Assume that every subsequence of X_n has a sub-subsequence which converges to X. Fix \epsilon>0 and consider the numerical sequence p_n(\epsilon)=P(|X_n-X| \ge \epsilon). Since this sequence is bounded, by Bolzano-Weierstrass theorem, let p_{n_k} be a convergent subsequence with \lim_{k} p_{n_k} =p. By assumption X_{n_k} has a convergent subsequence X_{n_{k_j}} which converges to X almost surely. This implies, by part 1., that p_{n_{k_j}} converges to 0. This means that for the sequence p_n, every convergent subsequence has a subsubsequence which converges to 0. This implies that p_n converges to 0 and thus X_n converges to X in probability. \quad \square.

Based on this we obtain the following continuity theorem

Theorem 11.4 (Continuity theorem for convergence in probability)

If X_n converges to X almost surely and f is continous function then f(X_n) converge to f(X) almost surely.
If X_n converges to X in probability and f is continous function then f(X_n) converge to f(X) in probability.

Proof. Part 1. is just the definition of continuity.
For part 2. suppose X_n converges to X in probability. Then, by Theorem 11.3, there exists a subsequence X_{n_k} which converges almost surely which implies, by part 1, that f(X_{n_k}) converges to f(X) almost surely.

Now we apply part 3. of Theorem 11.3 to the sequence Y_n= f(X_n). Since X_{n_k} converges in probability, by the previous paragraph, every subsequence f(X_{n_k}) has a convergent subsubsequence whic converges to f(X) a.s. and thus f(X_n) converges to f(X) in probability.

Using a similar argument we show that convergence in probability is preserved under arithmetic operations.

Theorem 11.5 Suppose X_n converges to X in probability and Y_n converges to Y in probability then X_n+Y_n converges to X+Y in probability, X_n-Y_n converges to X-Y in probability, X_n Y_n converges to XY in probability and X-n/Y_n converges to X/Y in probability (assuming that Y_n and Y are almost surely non-zero).

Proof. All the proofs are the same so let us do the sum. We pick a subsequence such that X_n converges to X almost surely along that subseqence. Then we pick a subsubsequence such that both X_n and Y_n converges almost surely along that subsequence. For that subsequence X_n+Y_n converges to X+Y almsost surely. We now apply this argument to subsequence of X_n+Y_n, every such subsequence as a subsubsequence which converges to X+Y almost surely and thus by part 3. of Theorem 11.3 X_n+Y_n converges to X+Y almost surely.

We finish this section by showing that we can use weak convergence to turn the space of RV into a complete metric space. We will use the following metric d(X,Y) = E[ \min\{ |X-Y|, 1\} ] The choice is not unique, often one will find instead d(X,Y)=E[ \frac{|X-Y|}{1 + |X-Y|}].

Theorem 11.6 (Convergence in probability and metric)

X_n to converge to X in probability if and only if \lim_{n \to \infty}d(X_n,X) =0.+
The space of meaurable RV,
L^0(\Omega, \mathcal{A}, P)=\{ X :(\Omega, \mathcal{A}, P) \to (\mathbb{R},\mathcal{B}) \text{ measurable} \} with the metric d is a complete metric space for the metric d(X,Y) (if we identify RV which are a.s. equal).
Equivalently, for any Cauchy sequence \{X_n\} for convergence in probability there exists a random variable Y such that X_n converges to Y in probbaility.

Proof. It is easy to check that d(X,Y) is a distance.

We also have for \epsilon \in (0,1) and x \ge 0 \epsilon i_\epsilon(x) \le \min\{x,1\} \le \epsilon + i_\epsilon(x) Replacing x by |X_n-X| and taking expectations and n \to \infty shows that \epsilon P(|X_n-X| \ge \epsilon) \le d(X_n,X) \le \epsilon + P(|X_n-X| \ge \epsilon) and this proves the second claim.

Finally assume that \{X_n\} is a Cauchy sequence for the metric d. Then \lim_{n,m} d(X_n,X_m) =0 or, as we have just seen, equivalently for any \epsilon >0 we can find N such that P(|X_n-X_m| \ge \epsilon) \le \epsilon for n,m\ge N. Choose now \epsilon_k = \frac{1}{2^k} and find corresponding N_k \le N_{k+1} \le \cdots. Setting Y_k=X_{N_k}, this implies that \sum_{k} P(|Y_k-Y_{k+1}|>\epsilon_k) < \infty and thus Y_k converges almost surely to a RV Y.

We show now that X_n converges to Y in probability. Since |X_n-Y| \le |X_n - X_{N_k}| + |Y_k -Y| we have i_\epsilon(|X_n-Y|) \le i_{\epsilon/2}(|X_n-X_{N_k}|) + i_{\epsilon/2}(|Y_k-Y|) \,. Taking expectations gives P(|X_n-Y| > \epsilon) \le P(|X_n-X_{N_k}| > \epsilon/2) + P(|Y_k-Y|> \epsilon/2). The first goes to 0 as n goes to \infty since the sequence X_n is Cauchy and the second term goes to 0 since we have almost sure convergence. Therefore X_n converges to Y in probability. \quad \square.

11.4 Convergence in L^p.

Convergence in L^p simply uses the norm \|X\|_p. Most of the time we use p=1 or p=2.

Definition 11.3 (Convergence in L^p) A sequence of RVs \{X_n\} converges in L^p to a RV X if \lim_{n \to \infty} E[|X_n-X|^p] =0 or equivalently \lim_{n \to \infty}\|X_n-X\|_p=0.

Remarks:

The limit of a sequence in L^p is unique since \|X-Y\|_p\le \|X-X_n\|_p + \|Y-X_n\|_p by Minkovski inequality.
Note that if X_n converges to X in L^1 then we have convergence of the first moments. Indeed we have |E[X_n]-E[X]| \le E[|X_n-X|] \quad \text{and} \quad |E[|X_n|] - E[|X|]| \le E[|X_n-X|] (the second follows the reverse triangle inequality ||x|-|y||\le |x-y|). Therefore E[X_n]\to E[X] and E[|X_n|]\to E[|X].
If X_n converges in L^1 then X_n does not need to converge almost surely. See the sequence Equation 11.2 which also converges to 0 in L^1.

If X_n converges in L^p then X_n converges in probability as well. We have P(|X-X_n| \ge \epsilon) \le E[|X_n-X|^p]/{\epsilon^p} by Markov inequality. In particular, by Theorem 11.3 if X_n converges to X in L^p then there exists a subsequence X_{n_k} which converges almost surely to X.
Conversely convergence in probability does not imply convergence in L^1. Modify the sequence in Equation 11.2 to make it Y_1=X_1, Y_2=2X_2, Y_3=2X_3, Y_4=3X_4, \cdots. This sequence converges in probability to 0 as well! This ensures that E[Y_n]=1 so Y_n does not converge to 0 in L^1. Note also that for any m there are infitely many m \ge n such that E[|X_n-X_m|=2 so the sequence cannot converge.

We prove now a converse which looks a bit like dominated convergence theorem.

Theorem 11.7 (Convergence in L^p versus convergence in probability)

If X_n converges to X in L^p then X_n converges to X in probability.
If X_n converges to X in probability and |X_n|\le Y for some Y \in L^P then X_n converges to X in L^p.

Proof. We already discussed 1. (Markov inequality). For the converse, since X_n converges in probability to X there exists a subseqeunce X_{n_k} which converges almost surely to X. Since |X_{n_k}|\le Y we see that |X|< Y and thus X \in L^p.

The sequence a_n =E[|X-X_n|^p], is bounded sinc, by Minkowski a_n^{1/p}=\|X-X_n\|_p \le \|X\|_p +\|X_n\|_p \le 2\|Y\|_p. Let a_{n_k} be aconvergent subsequence. Then since X_{n_k} converges to X in probability there exists a subsubsequence X_{n_{k_j}} which converge to X a.s Then |X_{n_{k_j}}-X|^P converges almost surely to 0 and |X_{n_{k_j}}-X|^P \le 2^p |Y|^p which is integrable. So by DCT a_{n_{k_j}} converges to 0. This implies that a_n converges to 0. \quad \square.

11.5 Exercises

Exercise 11.1 Show (by a counterxample) that if f is not continuous, convergence of X_n in probability to X does not imply convergence of f(X_n) to f(X) in probability.

Exercise 11.2 Show that if X_n is any sequence of random variables taking value in \mathbb{R} then we can find c_n such that \frac{X_n}{c_n} \to 0 almost surely.
Hint: Use Borel Cantelli Lemma

Exercise 11.3 Let X_1, X_2, … be independent Bernoulli random variables with P(X_n = 1) = p_n and P(X_n = 0) = 1−p_n.

Show that X_n converges to 0 inprobability if and only if \lim_n p_n=0.
Show that X_n converges to a.s. if and only if \sum_n p_n < ∞.

Hint: Use the Borel Cantelli Lemmas

12 The law of large numbers

The law of large numbers is a foundational (and also very intuitive) concept for probability theory. Suppose we are interested in finding the the probability P(A) for some event A which is the outcome of some (random experiment). To do this repeat the experiment n times, for sufficiently large n and an approximate value for P(A) is the proportion of experiments for which the outocme belong to A

P(A) \approx \frac{ \text{ number of times the experiment belongs to } A }{n} \quad \text{ if $n$ is large enough}

This is the basis for the frequentist approach to probability, probability of events are obtained by repeating the random experiment.

12.1 Strong of law of large numbers

Consider a probability space (\Omega, \mathcal{A},P) on which real-valued random variables X_1, X_2, \cdots are defined. We define then the sum S_n = X_1 + \cdots + X_n Law of large numbers stands for the convergence of the average \frac{S_n}{n} to a limit. The convergence can be in probability in which case we talk of a weak law of large numbers or almost sure convergence in which case we talk about a strong law of large numbers.

Example: Normal Consider X_1, X_2, \cdots, X_n independent normal RV with mean 0 and variance \sigma^2. Then using the moment generating function we have E\left[e^{t\frac{S_n}{n}}\right] = E\left[e^{\frac{t}{n}X_1} \cdots e^{\frac{t}{n}X_1} \right] = E\left[e^{\frac{t}{n}X_1}\right] \cdots E\left[e^{\frac{t}{n}X_1} \right] = \left(e^{\frac{\sigma^2}{2} \frac{t^2}{n^2}} \right)^n = e^{\frac{\sigma^2}{2} \frac{t^2}{n}} By Theorem 10.8 we conclude that \frac{S_n}{n} is a normal random variable with variance \frac{\sigma^2}{n}.

By Chebyshev we conclude that P\left( \left|\frac{S_n}{n}\right| \ge \epsilon)\right) \le \frac{\sigma^2}{n \epsilon} and so \frac{S_n}{n} converges to 0 in probability. This is not enough to show almost sure convergence since \sum_{n}\frac{\sigma^2}{n \epsilon} =\infty.

By Chernov however we have (see the example after Theorem 7.7) P\left( \left|\frac{S_n}{n}\right| \ge \epsilon \right) \le e^{-n \epsilon^2/2\sigma^2} and since \sum_{n}e^{-n t^2/2\sigma^2} < \infty Borel-Cantelli Lemma (see Theorem 11.2) implies that \frac{S_n}{n} converges to 0 a.s.

Example: Cauchy Consider X_1, X_2, \cdots, X_n independent Cauchy RV with parameter \beta. Using the characteristic function (recall the chararcteristic function of a Cauchy RV is e^{-\beta|t|}) we find that E\left[e^{it\frac{S_n}{n}}\right]=E\left[e^{\frac{it}{n}X_1} \cdots e^{\frac{it}{n}X_1} \right] = E\left[e^{\frac{it}{n}X_1}\right] \cdots E\left[e^{\frac{it}{n}X_1} \right] = \left( e^{-\beta\left|\frac{t}{n}\right|} \right)^n = e^{-\beta|t|} that is {S_n}{n} is a Cauchy RV with same parameter as X_i. No convergence almost sure or in probability seems reasonable here convergence in distribution will be useful here.

We start with a (not optimal) version of the LLN

Theorem 12.1 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i] and variance \sigma^2=\rm{Var}(X_i)< \infty. Then we have \lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} = \mu \quad \left\{ \begin{array}{l} \text{almost surely}\\ \text{in probability} \\ \text{ in } L^2 \end{array} \right.

Proof. We show first that \frac{S_n}{n} converges to \mu in L^2. To simplify the computation we write Y_i=X_i-\mu such that E[Y_i]=0 and E[Y_i^2]={\rm Var}(Y_i)={\rm Var}(X_i)=\sigma^2. We have then \frac{S_n}{n}- \mu = \frac{1}{n}(Y_1+ \cdots + Y_n) and by independence E[Y_i Y_j]=0 for i \not= j. Using all this E\left[\left( \frac{S_n}{n}- \mu\right)^2\right] = E\left[ \left( \frac{1}{n} \sum_{j=1}^n Y_j\right)^2\right] = \frac{1}{n^2}E\left[ \sum_{i,j=1}^n X_i X_j\right] = \frac{1}{n^2}E\left[ \sum_{i=1}^n Y_i^2 \right] + \underbrace{\frac{1}{n^2}E\left[ \sum_{i \not= j} Y_i Y_j \right]}_{=0} = \frac{\sigma^2}{n} From this we conclude that \frac{S_n}{n} converges to \mu in L^2 and therefore also in probability.

The hard part is to show that the convergence is also almost sure. First we consider an explicit subsequence on which it converges almost surely. We pick the sequence n_k = k^2. Then we have P\left(\left|\frac{S_{k^2}}{k^2}- \mu\right| \ge \epsilon\right) \le E\left[\left( \frac{S_{k^2}}{k^2}- \mu\right)^2\right] =\frac{\sigma^2}{k^2} and since \sum_{k} \frac{\sigma^2}{k^2} < \infty, by Borel- Cantelli Lemma \frac{S_{k^2}}{k^2} converges to \mu almost surely.

For the next step, without loss generality we assume that the RV X_i are non-negative. If not write X_i=X_{i+} - X_{i-} into positive and negative part, then the results will follow for general X from the result for positive X.

Given n pick k=k(n) such that k^2 \le n \le (k+1)^2. Since all the terms are positive we have \frac{x_1 + \cdots + x_{k^2}}{k^2} \frac{k^2}{n} \le \frac{x_1 + \cdots + x_n}{n} \le \frac{(k+1)^2}{n} \frac{x_1 + \cdots + x_{(k+1)^2}}{(k+1)^2} Now as n \to \infty, k=k(n) \to \infty and since \lim_{k \to \infty} \frac{(k+1)^2}{k^2} =1 1 \le \frac{n}{k^2} \le \frac{(k+1)^2}{k^2} \implies \lim_{n}\frac{n}{k^2} =1 \,. Simlarly \lim_{n\to \infty } \frac{(k+1)^2}{n}=1 and this shows that \frac{S_n}{n} converges almost surely. \quad \square.

A stronger version exists, only existence of a finite mean is needed. Proof will be provided with aid of Martingales later on.

Theorem 12.2 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i]. Then we have \lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} = \mu \quad \left\{ \begin{array}{l} \text{almost surely}\\ \text{in probability} \end{array} \right.

We prove next that the case \mu=\infty can also be treated.

Theorem 12.3 Suppose X_1,X_2, \cdots are independent indetically distributed non-negative random variables with E[X_i]=+\infty. Then we have \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} = +\infty \quad \text{almost surely}

Proof. We use a truncation argument combined monotone convergence theorem. Pick R>0 and set Y_n = \max\{ X_n, R\} which is bounded and thus has finite variance. So by Theorem 12.2 we have, almost surely, for \mu_R =E[ \max\{ X_1, R\}] \lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R Since X_n \ge Y_n we have for any R \liminf_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} \ge \lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R But as R \nearrow \infty \max\{ X_1, R\} \nearrow Y_1 and thus by the monotone convergent theorem \mu_R =E[ \max\{ X_1, R\}]\nearrow E[X_1]=\infty. This concludes the proof. \quad \square.

12.2 Sample variance

Example: convergence of the sample variance Suppose X_1, X_2, \cdots are independent and identically distributed RV, the the sample variance is given by V_n =\sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2 and as proved in Exercise 8.5 E[V_n]= (n-1)\sigma^2.

We claim that \frac{V_n}{n} \to \sigma^2 almost surely. Indeed we have \frac{V_n}{n}= \frac{1}{n} \sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2 = \frac{1}{n}\sum_{i=1}^{n} \left( X_i^2 - 2 X_i \frac{S_n}{n} + \frac{S_n^2}{n^2} \right) = \sum_{i=1}^{n} X_i^2 - \left(\frac{S_n}{n}\right)^2 By the law of Large numbers, Theorem 12.2, which we apply to the RV X_1, X_2, \cdots and the RV X_1^2, X^2_2, \cdots (note that E[X_1^2]=\sigma^2 + \mu^2) and by continuity, Theorem 11.4, we have \frac{V_n}{n} \to \sigma^2 + \mu^2 - \mu^2 =\sigma^2\,.

Code

import random
import matplotlib.pyplot as plt

# Parameters for the exponential distribution
lambda_parameter = 0.5  # Adjust this to your desired rate parameter

# Initialize variables to track sample mean and sample variance
sample_mean = 0
sample_variance = 0
sample_size = 0

# Number of samples to collect
num_samples = 100000

# Lists to store data for plotting
sample_means = []
sample_variances = []

for _ in range(num_samples):
    # Generate a random sample from the exponential distribution
    sample = random.expovariate(lambda_parameter)
    
    # Update the sample size
    sample_size += 1
    
    # Update the sample mean incrementally
    sample_mean = sample_mean + (sample - sample_mean) / sample_size
    
    # Update the sample variance incrementally
    if sample_size > 1:
        sample_variance = ((sample_size - 2) * sample_variance + (sample - sample_mean) * (sample - sample_mean)) / (sample_size - 1)
    
    # Append sample mean and variance to the lists for plotting
    sample_means.append(sample_mean)
    sample_variances.append(sample_variance)

print(f"Last 20 sample means ={sample_means[-20:]}")

print(f"Last 20 sample variances={sample_variances[-20:]}")

# Plot the sample mean and sample variance as a function of sample size
plt.figure(figsize=(8, 12))
plt.subplot(211)
plt.plot(range(1000, num_samples + 1), sample_means[999:])
plt.title("Sample Mean vs Sample Size")
plt.xlabel("Sample Size")
plt.ylabel("Sample Mean")

plt.subplot(212)
plt.plot(range(1000, num_samples + 1), sample_variances[999:])
plt.title("Sample Variance vs Sample Size")
plt.xlabel("Sample Size")
plt.ylabel("Sample Variance")

plt.tight_layout()
plt.show()

Last 20 sample means =[1.9916166329650253, 1.9916801012394052, 1.9916644813493183, 1.9916465502264151, 1.9916466571287432, 1.991634957874698, 1.9916157206710834, 1.9916031411378707, 1.9915889290911208, 1.9915985827755547, 1.9916474433120372, 1.9916280347825028, 1.9916134392592475, 1.9915983200218126, 1.9916064779928553, 1.9915870651156033, 1.9915852239662792, 1.9915700612986784, 1.9915806244533716, 1.991581218420978]
Last 20 sample variances=[3.920247574976993, 3.920611110700418, 3.9205962912359156, 3.9205892256577854, 3.9205500146342103, 3.92052448845379, 3.9205222795387136, 3.920498891626981, 3.920479877736396, 3.920449986961801, 3.920649489870106, 3.920647945557955, 3.920630037167405, 3.9206136856563227, 3.920581132016795, 3.9205796083411073, 3.920540739946253, 3.920524523321814, 3.9204964750930067, 3.920457305015601]

12.3 LLN proof of the Weierstrass approximation theorem

A classical result in analysis is that a continuous function f:[a,b] \to \mathbb{R} can be uniformlu approximated by polynomial: for any \epsilon >0 there exists a polynomila p(x) such that \sup_{x \in [a,b]}|p(x) - f(x)|\le \epsilon. Many proof of this result exists and we give one here based on the Law of Large Numbers although the statement has nothing to do with probability. Without loss of generality, by rescaling, we can take [a,b]=[0,1] and we use polynomial naturally associated to binomial random variables, the Bernstein polynomials.

Theorem 12.4 Let f:[0,1]\to \mathbb{R} be a continuous function. Let f_n(x) be the Bernstein polynomial of degree n associated to f, given by f_n(x) = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n)\,. Then we have \lim_{n\to \infty} \sup_{x \in [0,1]} |f(x)-f_n(x)| =0 \,.

Proof. If X_i are IID Bernoulli with success probability p random then S_n=X_1 + \cdots + X_n is binomial random RV and E\left[f\left(\frac{S_n}{n}\right)\right] = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n) = f_n(p) Since \frac{S_n}{n} converges p a.s and in probability we have f\left(\frac{S_n}{n}\right) converges to f(p) and thus taking expectation f_n(p) converges to f(p). We still do need to work harder, though, to establish uniform convergence, in p.

The variance of a binomial is n p(1-p) \le \frac{n}{4} is bounded uniformly in p\in [0,1] and thus by Chebyshev P\left( \left|\frac{S_n}{n} - p\right| \ge \delta \right) \le \frac{p(1-p)}{n \delta^2} \le \frac{1}{4 n \delta^2} Since f is continuous on a compact interval then f is bounded with \sup_{x}|f(x)|=M < \infty and f is also uniformly continuous on [0,1]. Given \epsilon >0 pick \delta such |x-y| < \delta \implies |f(x)-f(y)| < \epsilon. We have, for n large enough,
\begin{aligned} & |f_n(p)-f(p)|=\left| E\left[f\left(\frac{S_n}{n}\right)\right] - f(p) \right| \le E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right|\right] \\ &= E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| \ge \delta} \right] + E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| < \delta} \right] \le 2M \frac{1}{4 n \delta^2} + \epsilon \le 2\epsilon \end{aligned} This proves uniform convergence. \quad \square.

12.4 Empirical density and Glivenko-Cantelli

Example: sample CDF and empirical measure Suppose X_1, X_2, \cdots are independent and identically distributed RV with common CDF F(t) F_n(t)=\frac{\# \{ i \in \{1,\cdots,n\}: X_i \le t\}}{n} \to F(t) \quad \text{ almost surely}\,. \tag{12.1} F_n(t)=F_n(t,\omega) is called the empirical CDF. Note that F_n(t,\omega) is the (random) CDF for the discrete random variable with distribution \frac{1}{n} \sum_{i=1}^n\delta_{X_i(\omega)}
which is called the empirical distribution. The convergence of F_n(t) to F(t) is is just the law of large number applied to Y_n=1_{X_n \le t} whose mean is E[Y_n]=P(Y_n \le t)=F(t).

A strengthening of the law of large number is that the empirical CDF F_n(t) converges to F(t) uniformly in t.

Theorem 12.5 (Glivenko-Cantelli Theorem) For a RV X with CDF F(t) we have \sup_{t \in \mathbb{R}}|F_n(t)-F(t)| \text{ converges to } 0 \quad \text{almost surely}.

Proof. We only the prove the case where F(t) is continuous but the proof can be generalized to general CDF by considering the jumps more carefully. The proof relies on the fact that F is increasing which precludes oscillations and control the convergence.

First we show that we can pick a set of probability 1 such that the convergence occurs for all t \in \mathbb{R} on that set. Since a countable union of sets of probability 0 has probability 0, we can pick a set \Omega_0 of probability 1 such F_n(t,\omega) converges to F(t) for all *rational t \in \mathbb{R} and all \omega \in \Omega_0. For x \in \mathbb{R} and rational s,t with s \le x \le t we have F_n(s,\omega) \le F_n(x,\omega) \le F_n(t,\omega) and therefore F(s)\le \liminf_{n}F_n(x,\omega) \le \limsup_{n}F_n(x,\omega) \le F(t) Since F(t) \searrow F(x) as t \searrow x and F(s) \nearrow F(x) as s \nearrow x we conclude that F_n(t,\omega) \to F(t) for all \omega \in \Omega_0.

We show next that, for any \omega \in \Omega_0, the convergence is uniform in t. Since F is increasing and bounded, given \epsilon>0 we can find t_0=-\infty < t_1 < \cdots < t_m=+\infty such that F(t_j)-F(t_{j-1}) \le \frac{\epsilon}{2}. Using that F_n and F are increasing we have for t \in [t_{j-1},t_j] \begin{aligned} F_n(t)-F(t) &\le F_n(t_j) - F(t_{j-1}) \le F_n(t_j)- F(t_j) + \frac{\epsilon}{2}\\ F_n(t)-F(t) &\ge F_n(t_{j-1}) - F(t_{j}) \ge F_n(t_{j-1})- F(t_{j-1}) - \frac{\epsilon}{2}\\ \end{aligned} We can now pick N_j=N_j(\omega) such that |F_n(t_j)- F(t_j)|\le \frac{\epsilon}{2} if n \ge N_j and therefore if n\ge N=\max_j N_j we have, for all t \in \mathbb{R}, |F_n(t,\omega)-F(t)| \le \epsilon for all t and all n \ge N(\omega). This that \sup_{t\in \mathbb{R}}|F_n(t)-F(t)| converges almost surely to 0. \quad \square.

Illustration of the Glivenko-Cantelli Theorem (made with Chat GPT)

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Generate random data from a Beta distribution
np.random.seed(42)
true_distribution = beta.rvs(2, 5, size=1000)

# Generate empirical distribution function
def empirical_distribution(data, x):
    return np.sum(data <= x) / len(data)

# Compute empirical distribution function for different sample sizes
sample_sizes = [10, 100, 1000]
x_values = np.linspace(0, 1, 1000)

plt.figure(figsize=(12, 8))

for n in sample_sizes:
    # Generate a random sample of size n
    sample = np.random.choice(true_distribution, size=n, replace=True)
    
    # Calculate empirical distribution function values
    edf_values = [empirical_distribution(sample, x) for x in x_values]
    
    # Plot the empirical distribution function
    plt.plot(x_values, edf_values, label=f'n={n}')

# Plot the true distribution function
true_cdf_values = beta.cdf(x_values, 2, 5)
plt.plot(x_values, true_cdf_values, linestyle='--', color='black', label='True Distribution')

# Plot the difference between ECDF and true CDF
for n in sample_sizes:
    sample = np.random.choice(true_distribution, size=n, replace=True)
    edf_values = [empirical_distribution(sample, x) for x in x_values]
    difference = np.abs(edf_values - true_cdf_values)
    plt.plot(x_values, difference, linestyle='dotted', label=f'n={n} (Difference)')

plt.title('Glivenko-Cantelli Theorem with Beta Distribution')
plt.xlabel('x')
plt.ylabel('Empirical Distribution Function / Difference')
plt.legend()
plt.show()

12.5 The Monte-Carlo method

The (simple) Monte-Carlo method is a probabilistic algorithm using sums of independent random variables the law of large numbers to estimate a (deterministic) quantity \mu \in \mathbb{R} (or \mathbb{R}^d).
The basic idea is to express \mu as the expectation of some random variable \mu = E [ h(X) ] and then use the law of large numbers to build up an estimator for \mu.

Theorem 12.6 (Simple Monte-Carlo Sampling Algorithm) To compute \mu \in \mathbb{R}

Find a random variable h(X) such that \mu= E[h(X)].
I_n = \frac{1}{n} \sum_{k=1}^n h(X_k), where X_k are IID copies of X, is an unbiased estimator for \mu, that is we have
- For all n we have E[I_n] = \mu (unbiased).
- \lim_{n \to \infty} I_n = \mu almost surely and in probability.

An interesting part is that there are, in general, many ways to find the random variables h(X).
Conversely in many problems the random variable h(X) is given but the expection is too diffcult to compute, so we rely on the LLN to compute \mu.

12.6 Computing \pi with Buffon’s needles

This seems to be the first example of a rejection sampling used to solve a mathematical problem, by Le Comte de Buffon (see Bio in Wikipedia).
A needle of length l is thrown at random on floor made on floorboards of width d and we assume l \le d. We want to compute the probability that the needle does intersect two floor boards.
Denote by X the distance from the center of the needle to the nearest intersection (this is uniformly distributed on [0,\frac{d}{2}]) and by \Theta the acute angle between the needle and an horizontal line (this is uniformly distributed on [0,\frac{\pi}{2}]).
For the needle to intersect we must have x \le \frac{l}{2} \sin(\theta) and thus P\left(X \le \frac{l}{2} \sin(\Theta)\right) = \int_0^{\frac{\pi}{2}} \int_0^{\frac{l}{2}\sin(\theta)} \frac{2}{d} dx \, \frac{2}{\pi} d\theta = \frac{2l}{d\pi}
So in order estimate \pi you shall throw n needles on the floors at random and
\pi \approx \frac{2ln}{d} \frac{1}{ \# \textrm { of needles intersecting two floor boards}} No random number generator needed….

12.7 Computing \pi with random numbers

Enclose a disk of radius 1 in a square of side length 2 and consider the following Bernoulli random variable X.
- Generate 2 independent vectors V_1,V_2 uniformly distributed on [-1,1].
- If V_1^2 + V_2^2 \le 1, set X=1, otherwise set X=0.

X is Bernoulli with p=\frac{\textrm{area of the disk}}{\textrm{ area of the square}}=\frac{\pi}{4}

So \frac{4}{n}\sum_{k=1}^n X_i \to \pi with probability 1

Code

import numpy as np
N = 1000000

dart_positions = 2 * np.random.rand(N, 2) - 1  # generates numbers in [-1, 1]
Ncircle = [0] # start the count with 0 to make our loop-logic easier

for x,y in dart_positions:
    if  x**2 + y**2 < 1:
        Ncircle.append(Ncircle[-1] + 1) # another dart has fallen in the circle
    else:
        Ncircle.append(Ncircle[-1])  # the dart fell outside of the circle 

running_estimate = []
for n_total, n_circle in enumerate(Ncircle[1:]): # skip the inital 0-count
    # n_total starts at 0, so we need to add 1
    running_estimate.append(4 * n_circle / (n_total + 1))

np.transpose(running_estimate[-20:])

array([3.14242771, 3.14242856, 3.14242542, 3.14242628, 3.14242714,
       3.14242799, 3.14242885, 3.14242971, 3.14243057, 3.14243142,
       3.14243228, 3.14242914, 3.14243   , 3.14243085, 3.14243171,
       3.14243257, 3.14243343, 3.14243428, 3.14243114, 3.142432  ])

By Kmhkmh - Own work, https://commons.wikimedia.org/w/index.php?curid=140013480

12.8 Computing integrals

The goal is compute for example \int_a^b h(x) dx. Wihtout loss of generality by rescaling space and replacing f by cf +d we can assume that [a,b]=[0,1] and 0 \le h \le 1. For example take h(x)=\sqrt{1 - x^2} and we recover the previous example.
Monte-Carlo version I:
Pick independent random numbers (U_1,U_2) on [0,1]\times [0,1]. Define a Bernoulli RV X by X = \left\{ \begin{array}{cl} 1 & \text{ if } U_2 \le h(U_1) \\ 0 & \text{ else } \end{array} \right. Then E[X]= P(U_2 \le h(U_1)) = \int_0^1 \int_0^1 1_{\{x_2 \le h(x_1)\}} dx_2 dx_1 = \int_0^1 \int_0^{h(x_1)} dx_2= \int_0^1 h(x) dx and so for independent X_i, \frac{1}{n}\sum_{i=1}^n X_i \to \int_0^1 h(x) dx almost surely.
Monte-Carlo version II:
Pick U uniform on [0,1] then E[h(U)]=\int_0^1 h(x) dx and so for independent U_i, \frac{1}{n}\sum_{i=1}^n f(U_i) \to \int_0^1 h(x) dx almost surely.

Monte-Carlo version III:

Pick V non-uniform on [0,1] with density f (e.g a beta RV with parameter \alpha,\beta). Then we have \int_0^1 h(x) dx = \int_{0}^1 \frac{h(x)}{f(x)} f(x) dx and so if V has distribution f then \frac{1}{n} \sum_{i=1}^n \frac{h(V_i)}{f(V_i)} converges to E[\frac{h(v)}{f(V)}]=\int_0^1 h(x) dx.

This is the idea behind importance sampling: you want to sample more points from regions where h is large and which contribute more ot the integral.
We can generalize this to integral of subsets of \mathbb{R}^n or integral over the whole space.

12.9 Quantitative version of the law of large numbers

How many sample should we generate to obtain a given precision for the computation of \mu = E[X]?
In Monte-Carlo methods \mu itself is unknown, so we would like to make a prediction about \mu given the sample mean \frac{S_n}{n}=\frac{X_1 + \cdots + X_n}{n}.
Convergence in probability is the way to go: we want to estimate

P\left(\left|\frac{S_n}{n} -\mu \right| \le \epsilon\right) = P\left( \mu \in \left[\frac{S_n}{n} -\varepsilon, \frac{S_n}{n} +\varepsilon\right] \right) \ge 1 - \delta which gives a confidence interval for \mu in terms of the sample size n and the tolerance \epsilon. For example of \delta=0.01, if we have n sample, we can predict tolderance for \mu with 99\% confidence.

If we know the variance we could use Chebyshev P\left( \mu \in \left[\frac{S_n}{n} -\varepsilon, \frac{S_n}{n} +\varepsilon\right] \right) \ge 1- \frac{\sigma^2}{n \epsilon^2}. But it may require knowledge of the variance and could be overly pessimistic if we have many moments.
Even better we could use Chernov bonds which gives exponentiall (in n) bounds P\left(\frac{S_n}{n} -\mu \ge \epsilon \right) =P(S_n \ge n(\mu+\epsilon)) \le \inf_{t \ge 0} \frac{E[^{tS_n}]}{e^{n(\mu+\epsilon)}} = e^{- n \sup_{t \ge 0} \left\{t(\mu+\epsilon)- \ln M(t) \right\} } and a similar bound for P\left(\frac{S_n}{n} -\mu \le - \epsilon \right).

12.10 Hoeffding’s bound

Chernov bounds are very sharp but requires knowledge of the mgf M_X(t).

One of the main idea behind concentration inequalities: given X bound M_X(t) \le M(t) by the mfg M(t) of a random variable Y which you know explcitly.
Mostly here we take Y a Gaussian but one can also uses other ones, Bernoulli, Poisson, Gamma, etc…

The following elementary bound will be used repeatedly.

Lemma 12.1 (Hoeffding’s bound) Suppose a \le X \le b with probability 1. Then for any \varepsilon > 0

Bound on the variance \displaystyle \textrm{Var}(X) \le \frac{(b-a)^2}{4}
Bound on the mgf \displaystyle E\left[ e^{tX} \right] \le e^{t E[X]} e^{\frac{t^2 (b-a)^2}{8}}

Proof. For the bound on the variance a \le X \le b implies that \displaystyle -\frac{a-b}{2} \le X - \frac{a+b}{2} \le \frac{a-b}{2} and therefore \textrm{Var}(X) = \textrm{Var}\left(X - \frac{a+b}{2}\right) \le E\left[\left(X - \frac{a+b}{2}\right)^2\right] \le \frac{(b-a)^2}{4}\,.

Since X is bounded the moment generating function M(t) =\frac{e^{tX}}{E[e^{tX}]} exists for any t \in \mathbb{R}. To bound the M(t) let us consider instead its logarithmx u(t)=\ln M(t). We have \begin{aligned} u'(t) &= \frac{M'(t)}{M(t)} &=& E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right] \\ u''(t) &= \frac{M''(t)}{M(t)}- \left(\frac{M'(t)}{M(t)}\right)^2 &=& E\left[ X^2 \frac{e^{tX}}{E[e^{tX}]} \right] - E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right]^{2} \end{aligned} We recognize u''(t) as the variance under the tilted measure Q_t which is defined by E_{Q_t}[ \cdot] = E\left[ \cdot \frac{e^{tX}}{E[e^{tX}]} \right]. with tilted density \frac{e^{tX}}{E[e^{tX}]} and thus by part 1. (applied to Q_t) we have u''(t) \le \frac{(b-a)^2}{4}.
Using the Taylor expansion with remainder we have, for some \xi between 0 and t \ln M(t)= u(t) = u(0) + u'(0)t + u''(\xi) \frac{t^2}{2} \le t E[X] + \frac{t^2(b-a)^2}{8}\,. This concludes the proof.

Remark: The bound on the variance in 1. is optimal. Indeed taking without loss of generality a=0 and b=1 then the variance is bounded by 1/4 and this realized by taking X to be a Bernoulli with p=\frac12. This bound says that the RV with the largest variance is the one where the mass is distributed at the end point.

The bound in 2. is optimal only in the sense that it is the best quadratic bound on u(t). For example for a Bernoulli with a=0 and b=1 we have M(t)=\ln (\frac{1}{2}e^{t} + \frac{1}{2})= \frac{1}{2}t + \ln \cosh\left(\frac{t}{2}\right) which is much smaller (for large t). There is room for better bounds but using Gaussian is computationally convenient.

If we appply the Hoeffding’s bound to a sum of random variables we find

Theorem 12.7 (Hoeffding’s Theorem) Suppose X_1, \cdots, X_n are independent random variables such that a_i \le X \le b_i (almost surely). Then \begin{align} P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \ge \varepsilon \right) &\le e^{- \frac{2 \varepsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}}\\ P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \le -\varepsilon \right) &\le e^{- \frac{2 \varepsilon^2}{ \sum_{i=1}^n(b_i-a_i)^2}} \end{align} \tag{12.2}

Proof. Using independence the Hoeffding’s bound we have e^{t (X_1 + \cdots + X_n - E[X_1 + \cdots + X_n])} =\prod_{i=1}^n e^{t(X_i-E[X_i])} \le \prod_{i=1}^n e^{\frac{t^2 (b_i-a_i)^2}{8}} = e^{ \frac{t^2 \sum_{i}(b_i-a_i)^2}{8}} and using Chernov bound (for a Gaussian RV with variance \displaystyle \sum_{i}\frac{(b_i-a_i)^2}{4}) gives the first bound in Equation 12.2.
The second bound is proved similarly. \quad \square.

We obtain from this bound a confidence interval for emprical sum

Corollary 12.1 (non-asymptotic confidence interval) Suppose X_1, \cdots, X_n are independent random variables such that a \le X \le b (almost surely) and \mu=E[X_i]. P\left( \mu \in \left[ \frac{S_n}{n}-\varepsilon, \frac{S_n}{n} +\varepsilon \right] \right) \ge 1- 2 e^{-\frac{2 n \varepsilon^2}{(b-a)^2}}

Proof. This is Hoeffding’s bound with \varepsilon replaced by n\varepsilon and with \sum_{i=1}^n (b_1-a_1)^2 = n(b-a)^2. \quad \square.

Using

\delta = 2 e^{-\frac{2 n \varepsilon^2}{(b-a)^2}} \iff \epsilon = \sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2n}} we get the confidence interval P\left( \mu \in \left[ \frac{S_n}{n}- \sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2n}}, \frac{S_n}{n} +\sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2 n}} \right] \right) \ge 1 -\delta

12.11 Bernstein bound

In Hoeffding’s bound we use, in an essential way, a bound on the variance. If the variance is small then one should expect the bound to be poor. Bernstein bound, which can be used if we have some a-priori knowledge about the variance, improves.

Theorem 12.8 (Bernstein Bound) Suppose X is a random variable such that |X-E[X]| \le c and \textrm{var}(X) \le \sigma^2. Then E[e^{t X}]\le e^{t E[X]+ \frac{\sigma^2 t^2}{2(1 -c|t|/3)}} \,.

Proof. We expand the exponential and use that for k\ge 2, with \mu=E[X], E\left[(X-\mu)^k\right] \le E\left[(X-\mu)^2|X-\mu|^{k-2}\right]\le E[(X-\mu)^2] c^{k-2} \le \sigma^2 c^{k-2} and get \begin{aligned} E\left[e^{t(X-\mu)}\right]=1+ \sum_{k=2}^\infty \frac{t^k}{k!} E[(X-\mu)^k] &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}\frac{2}{k!}(|t|c)^{k-2} \\ &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}^\infty \left(\frac{|t|c}{3}\right)^{k-2} \quad \textrm{ since } \frac{k!}{2}\ge 3^{k-2}\\ &\le 1 + \frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})} \le e^{\frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})}} \quad \textrm{ since } 1+ x \le e^x \quad \square \end{aligned}

To combine this we a Chernov bound we have to solve the following optimzation problem which after some straightforward but lengthy computation gives \sup_{t \ge 0} \left\{ \varepsilon t - \frac{a t^2}{2(1-bt)} \right\} = \frac{a}{b^2}h\left( \frac{b\epsilon}{a}\right) \quad \textrm{ where } h(u) = 1 +u - \sqrt{1 + 2u} Note that we can invert the function h and we have h^{-1}(z)= z + \sqrt{2z}. This make the Bernstein bound especially convenient to get explcit formulas. By symmetry we find the same bound for the left tail.

Theorem 12.9 (Bernstein for sum of IID) If X_1, \cdots, X_N are IID random variables with |X_i| \le c and \textrm{Var}(X_i)\le \sigma^2 then

P\left( \mu \in \left[\frac{S_n}{n} - \frac{c}{3n} \ln\left( \frac{2}{\delta}\right) - \sqrt{\frac{2\sigma^2}{n} \ln\left( \frac{2}{\delta}\right)}, \frac{S_n}{n} + \frac{c}{3n} \ln\left( \frac{2}{\delta}\right) + \sqrt{\frac{2\sigma^2}{n} \ln\left( \frac{2}{\delta}\right)} \right] \right) \ge 1 - \delta

Proof. \begin{aligned} P\left(\frac{S_n}{n} - \mu \ge \varepsilon \right) & = P \left( X_1 + \cdots + X_n - \ge n(\mu+ \varepsilon) \right) \le e^{-n \sup_{t \ge 0} \{ \varepsilon - \frac{\sigma^2 t^2}{2(1 -ct/3)} \}} \end{aligned} and we obtain the same bound for P\left(\frac{S_n}{n} - \mu \ge \varepsilon \right).

To obtain a confidence interval we need to solve \delta = 2 e^{-n\frac{a}{b^2}h\left( \frac{b\varepsilon}{a}\right)} \iff \varepsilon = b \frac{1}{n} \ln\left( \frac{2}{\delta}\right) + \sqrt{2a \frac{1}{n} \ln\left( \frac{2}{\delta}\right)} and set a=\sigma^2 and b=c/3 to obtain the desired bound.

Comparison: Taking a=0,b=1 (\textrm{Bernstein}) \quad \frac{c}{3n} \ln\left( \frac{2}{\delta}\right) + \sqrt{2 \ln\left( \frac{2}{\delta}\right)} \frac{\sigma}{\sqrt{n}} \quad \textrm{ versus} \quad \sqrt{2\ln\left(\frac{2}{\delta}\right)} \frac{\sigma_{max}}{\sqrt{n}} \quad (\textrm{Hoeffding})

You should note that this bound can be substantially better than Hoeffding’s bound provided \sigma \le \sigma_{max}, if n is large then the 1/n term is much smaller than the 1/\sqrt{n} term and so Bernstein bound becomes better. See the illustration on the next page.

There is an even more sophisticated version of this bound where one uses the sample variance to build a bound. One needs then to estimate the probability that the sample varaince is far from the true variance which itself requires more sophisticated inequalities which is martingales.

As an illustration we plot the empirical mean as well as Hoeffding’s and Bernstein’s confidence interval for computing the mean beta RV so in Hoeffding’s the maximum variance is \frac{1}{4} and we can take c=1 in Bernstein. Here we take the parameter \alpha=1 and \beta=2 so the true mean is \frac{\alpha}{\alpha +\beta}=\frac{1} {3} and the true variance is \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha + \beta +1)} = \frac{1}{18} which is quite a bit smaller than \frac{1}{4}.

confidence intervals with Bernstein and Hoeffding

12.12 Exercises

Exercise 12.1 Suppose X_i are independent and indentically distributed normal random variable with mean 1 and variance 3. Compute \lim_{n\to \infty}\frac{X_1+ X_2 + \cdots+ X_n}{X_1^2+ X_2^2 + \cdots+ X_n^2}

Exercise 12.2 Suppose X_i are independent and indentically distributed with E[X_i]>0. Show that \lim_{n \to \infty}S_n= \lim_{n\to \infty} X_1+ \cdots + X_n = + \infty almost surely.

Exercise 12.3 Suppose X_i are independent and indentically distributed and strictly positive (i.e P(X_i>0)=1). Show that, almost surely, \lim_{n \to \infty} (X_1 X_2 \cdots X_n)^{\frac{1}{n}}= \alpha\,. This is a simple model used in financial application where X_i describe the change of your investment in any given day: if you start with F_0 your fortune is F_0X_1 after the first day, F_0X_1X_2 after the second day, and so on…..

13 Weak convergence aka convergence in distribution

13.1 Weak convergence of probability measures

In the notion of weak convergence we do not view random variables as map X: \Omega \to \mathbb{R} and work exclusively with their distribution P^X (that is probability measures on \mathbb{R}). Actually one can talk about weak convergence of random variables even if they do not live on the same probability space!

Definition 13.1 (Weak convergence of probability measures and convergence in distribution for random variables)

The sequence (P_n) of probability measures on \mathbb{R} converges weakly to \mu if \lim_{n \to \infty} \int f dP_n = \int f dP \ for any f: \mathbb{R} \to \mathbb{R} bounded and continuous.
The {sequence of RVs (X_n) converges to X in distribution if the distribution P^{X_n} of X_n converges weakly to the distribution P^X of X, i.e., \lim_{n \to \infty} E[f(X_n)] = E[f(X)] for any f: \mathbb{R} \to \mathbb{R} bounded and continuous.

Remark: We can generalize this easily to RV taking values in \mathbb{R}^n or some metric space (so we can talk about bounded continuous function). We stick with \mathbb{R} for simplicity.

13.2 Simple properties and some examples

Theorem 13.1 (weak means weak)

If X_n converges to X {in L^p or in probability, or almost surely}, then X_n converges to {X weakly}.
If X_n converges indistribution to a constant RV X=a then X_n converges to a in probability.

Proof. We show that convergence in probability implies convergence in distribution. Since almost sure convergence and convergence in L^p implies convergence in probability, this will prove 1. But as we have proved in Theorem 11.4 if X_n converges to X in probability, the continuity of f implies that Y_n=f(X_n) converges to Y=f(X) in probability. Since f is bounded the random variables Y_n and Y are bounded (and so in any L^p). As we proved in Theorem 11.7 this implies that \lim_{n\to \infty} E[f(X_n)]=E[f(X)].

For the (partial) converse statement in 2. we take assume that X_n converges weakly to a constant X=a and consider the continuous function f(x)= \min\{|x-a|, 1 \}. Then we have E[ \min \{ |X_n-X], 1 \} ] = E[f(X_n)] \to E[f(X)] = f(a)=0 By Theorem 11.6 this implies thatX_n converges to X in probability.

Examples:

If X_i are identically RV (i.e. P^{X_n}=P for all n then X_i converges in distribution but X_n does not converges in any other sense except if X=X_1=X_2=\cdots.
An example is when X_i are IID Cauchy RV with parameter \beta then \frac{S_n}{n}=\frac{1}{n}(X_1 + \cdots+ X_n) are also Cauchy RV with paramter \beta. Then both X_n and \frac{S_n}{n} both converges in distribution to a Cauchy RV paramter \beta but these sequences do not converges in any other sense.
The measure P^n= \frac{1}{n} \sum_{i=1}^n \delta_{\frac{i}{n}} converges weakly to the Lebesgue measure on [0,1]. Indeed for f continuous, the Riemman sum \int f dP_n = \frac{1}{n}\sum_{i=1}^n f\left(\frac{i}{n}\right) \to \int f dx \,.
If P_n = \delta_{x_n} then P_n converges weakly to P if and only if P=\delta_x and x_n \to x.
Convergence of the quantile functions: Let Q_n be the quantile function for X_n and Q the quantile function for X and let P_0 be Lebesgue measure on [0,1]. Then P^{X_n}=P_0 \circ Q_N^{-1} and P^{X}=P_0 \circ Q^{-1}.
If the quantiles Q_n converges to Q for P_0 almost all \omega \in [0,1] then for f continuous and bounded f\circ Q_n converges to f \circ Q for P_0 almost all \omega \in [0,1]. By the bounded convergence theorem
E[f(X_n)]=\int_{\mathbb{R}} f(x) dP^{X_n}(x) = \int_{[0,1]} f( Q_n(\omega))dP_0(\omega) \to \int_{[0,1]} f( Q(\omega))dP^0(\omega) =E[f(X)] and thus X_n converges in distribution to X.

13.3 Convergence of sets probabilities

Basic question If P_n converges weakly to P and A\subset \mathbb{R} is some measurable set what does this imply for the convergence of P_n(A) = \int 1_{A} dP_n?

Notation If A is a set, then we denote by \overline{A} the closure of A, by \mathring{A} the interior of A, and by \partial A the boundary of A. We have \overline{A}= A \cup \partial A \quad \mathring{A}=A \setminus \partial A \quad \partial A = \overline{A} \setminus \mathring{A}

Theorem 13.2 (Weak convergence and set covergence) The following are equivalent

P_n converges weakly to P.
For any A closed we have \limsup_n P_n(A) \le P(A).
For any A open we have \liminf_n P_n(A) \ge P(A).
For any A with P(\partial A)=0 we have \lim_n P_n(A) = P(A).

We will prove 1. \implies 2. \iff 3. \implies 4. \implies 1

Proof.

Asssume 1. hold and A is a closed set. We consider the \epsilon-neighborhood of A: A_\epsilon = \{ x, d(x,A) < \epsilon \} \quad \quad \text{where}\quad d(x,A)=\inf \{ d(x,y)\,;\, y \in A\} Since A is closed A_\epsilon \searrow A and so by sequential continuity P(A_\epsilon) \searrow P(A).
Consider now function f(x) = \min\left\{ 1- \frac{d(x,A)}{\epsilon}, 0 \right\} which is bounded and continuous and satisfies 1_A \le f(x) \le 1_{A_\epsilon}. From this we see that P_n(A) \le \int f dP_n \quad \quad \text{ and } \quad \quad \int f dP \le P(A_\epsilon) Weak convergence means \int f dP_n \to \int f dP and thus \limsup_{n} P_n(A) \le P(A_\epsilon). Since this holds for all \epsilon we have proved 2.
1. and 3. are equivalent since complements of closed sets are open and vice versa and \liminf (1- r_n) = 1 -\limsup_n r_n for any sequence r_n in [0,1].
Assume 3. and then also 2. hold. Let A be a Borel set, since \overline{A} \supset A \supset \mathring{A} we have P(\overline{A})\ge \limsup_{n} P_n(\overline{A}) \ge \limsup_{n} P_n(A) \ge \liminf_{n} P_n(A) \ge \liminf_{n} P_n(\mathring{A}) \ge P(\mathring{A}) If P(\partial A)=0 then P(\overline{A})=P(\mathring{A}) and this implies that \lim_n P_n(A)=P(A).

Take f bounded and continuous so that a < f(x) < b and consider the probability P \circ f^{-1} on (a,b). Since atoms are countable we pick a partition a_0=a< a_ < a_2 < \cdots < a_m=b so that a_i-a_{i-1} < \epsilon and a_i is not an atom for P \circ f^{-1}.
Set A_i = f^{-1}((a_{i-1},a_i]) and define the simple function g=\sum_{i} a_{i-1} 1_{A_i}\,, \quad\quad h=\sum_{i} a_{i} 1_{A_i} \,. By construction we have f- \epsilon \le g \le f \le h \le f + \epsilon. \tag{13.1} Since f is continuous if x \in \partial A_i (A_i is collection of intervals) then f(x)=a_{i-1} or f(x)=a_i which are not atoms. Therefore P(\partial A_i)=0 and thus P_n(\partial A_i) \to 0 as n \to \infty. This implies that \int g dP_n \to \int g dP and \int h dP_n \to \int h dP and thus by Equation 13.1 we have \begin{aligned} \int f dP - \epsilon \le \int g dP = \lim_n \int g dP_n & \le \liminf_n \int f dP_n \\ &\le \limsup_n \int f dP_n \le \lim_n \int h dP_n =\int h dP \le \int f dP + \epsilon \end{aligned} Since \epsilon is arbitrary we \lim_n \int f dP_n = f dP. \quad \square.

13.4 Uniqueness of limits

Suppose P and Q are two probability measures on \mathbb{R} and suppose that \int f dp= \=int f dQ for all bounded continuous functions. Then we claim that P=Q.
We can provide a simple proof using Theorem 13.2. Take P_n=Q then P_n converges weakly to Q and so for any open set A we have \liminf P_n(A)=P(A) \ge Q(A). Exchanging the role of P and Q we get Q(A)\ge P(A) and thus P and Q coincide on all open sets and sych sets form a p-system which generate the \sigma-algebra.
As a consequence limits in weak convergence are unique, if P_n converges weakly to P and Q then P=Q.

13.5 Convergence of distribution and quantile functions

For random variables weak convergence is called convergence in distribution and the following theorem explain why.

Theorem 13.3 (Convergence in distribution and convergence of the distribution functions) The following are equivalent

The random variables X_n converges to X in distribution.
The CDF F_{X_n}(t)\to F(t) at every continuity point of F.
The quantile functions Q_{X_n}(z)\to Q_X(z) at every continuity point of Q.

Proof.

1. \implies 2.: Suppose t is a continuity point of F_X(t) then x is not an atom for P. From Theorem 13.2 part 4. we see that F_{X_n}(t)=P^{X_n}((-\infty,t])) converges to P^{X}((-\infty,t]))=F(t).
2. \implies 3.: Suppose z is a point of continuity for Q and let t=Q(z). Fix \epsilon>0, and choose s \in (t-\epsilon,t) and r \in (t, t+\epsilon) to be continuity points of F. Since Q is continuous at z, then F is not flat at level z and thus F(s) < z < F(r). Since F_n(s) \to F(s) by assumption we have F_n(s)< z and thus Q_n(z) > s > t -\epsilon for all but finitely many n. This means that \liminf Q_n(z) > t-\epsilon. A similar argument shows that
\limsup Q_n(z) < t+\epsilon. Thus \lim_{n}Q_n(z)=Q(z) and 3. holds.
3. \implies 2.: The quantile function being increasing has only countable many discontinuities and thus continuous Lebesgue almost everywhere. As we have seen in the example in Section 13.2 this implies convergence in dsitribution.

Example If X_i are independent and identically distributed random variables with commmon CDF F(t) then the Glivenko-Cantelli theorem implies that for almost all \omega F_n(t) = \frac1n \sum_{k=1}^n 1_{\{X_k(\omega)\le t\}} \to F(t) \quad \text{ for all }t. That is the empirical (random) measure \frac{1}{n}\sum_{k=1}^n \delta_{X_k(\omega)} converges weakly to P^X for almost all \omega. In this example the convergence occurs for all t even if F has dsicontinuities.

Example Consider the random variables X_n with distribution function F_n(t) = \left\{ \begin{array}{cl} 0 & t \le -\frac1n \\ \frac12 + \frac{n}{2} t& -\frac1n < t < \frac1n \\ 1 & t \ge \frac1n \\ \end{array} \right. Then F_n(t) converges to 0 for t \not =0 and F_n(0)=\frac{1}{2} for all n. So F_n(t) converges to F(t)=1_{[0,\infty)}(t) at all continuity pooints of F. So a uniform random variable on [-\frac1n, \frac1n] converges weakly to the random varibale X=0.

Example: extreme value and maxima of Pareto distribution Consider X_1, \cdots, X_n to be independent Pareto RV each with cumulative distribution function F(t)= 1- \frac{1}{t^\alpha} for t \ge 1 and 0 for t \le 1. We are interested in the distribution of the maximum M_n = \max_{m \le n} X_n for the limit of large n. We have, by independence P(M_n \le t)= P(X_1\le t, \cdots, X_n \le t)= F(t)^n = \left(1 - \frac{1}{t^\alpha}\right)^n This suggest the scaling t=n^{1/\alpha}y so that P(M_n \le n^{1/\alpha}y ) = \left(1 - \frac{y^{-\alpha}}{n}\right)^n \to e^{-y^{\alpha}} \quad \text{ as } n \to \infty. Thus we proved that M_n/n^{1/\alpha} converges in distribution to a distribution which is called a Frechet distribution.

13.6 Convergence of densities

We show next that convergence of the densities f_n(x) implies convergence in distribution.

Theorem 13.4 Suppose X_n is sequence of random variables with densities f_n(x) and f_n(x) converges Lebesgue almost everywhere to a density f(x). Then X_n converges in dsitribution to the random variables X with density f(x).

Proof. The cumulative distribution function F(t)=\int_{-\infty}^t f_n(x) dx is continuous for every t and we would like to show that F_n(t) converges to F(t)=\int_{-\infty}^t f(x) dx for every t. However we cannot use dominated convergence theorem since there is no dominating function for f_n.

We prove instead that for any h bounded and continuous we have \lim_{n} E[h(X_n)]=E[h(x)] using that f_n is non-negative and is normalized. Since h is bounded we set \alpha=\sup_{x}|h(x)| and consider the two non-negative function h_1(x) = h(x)+\alpha \ge 0 \quad h_2 =\alpha -h(x) \ge 0 We now apply Fatou’s Lemma to the sequence of non-negative functions h_1(x)f_n(x) and h_2(x)f_n(x). We have for i=1,2 E[ h_i(X)] = \int f(x) h_i(x) dx \le \liminf_{n} \int f_n(x) h_i(x) dx = \liminf_{n} E[h_i(X_n)]

From this we obtain E[h(x)] + \alpha \le \liminf_{n} E[h(X_n)] + \alpha \quad \text{ and } \quad \alpha - E[h(x)] \le \alpha - \limsup_{n} E[h(X_n)] and thus E[h(X)]= \lim_{n} E[h(X_n)]. \quad \square.

Example Suppose X_n is a sequence of normal random variables with mean \mu_n and variance \sigma_n. If \mu_n \to \mu and \sigma_n\to \sigma>0 then X_n converges to a normal random variable with mean \mu and variance \sigma^2. This follows from the fact that the density of a normal random variable is a continous function of \mu and \sigma^2.

13.7 Some remarks on convergence in total variation

The convergence of densities implies in fact a stronger mode of covergence than weak convergence.

We have not used the fact that h is continuous in the proof and thus we proved here that \lim_{n} E[h(X_n)] = E[h(X)] \quad \text{ for all } h \text{ bounded and measurable} In particular we can take h=1_A for any measurable set A and we have \lim_{n} P(X_n \in A) \to P(X \in A) \quad \text{ for all measurable sets} A ,.

This convergence is much stronger that weak convergence and is called convergence in total variation.

13.8 Tightness and Prohorov Theorem

Basic question: We prove next a compactness result with respect to weak convergence: given a collection of probability measures P^n on \mathbb{R} when can we expect to have the existence of a (weakly) convergent subsequence?

We first need a new concept.

Definition 13.2 (Tightness) A collection of probability measure \mathcal{P}=\{P_i\}_{i \in I} is tight if for any \epsilon>0 there exists R such that P_i( [-R,R] ) \ge 1 - \epsilon \quad \text{ for all } i \in I

The following theorem is (a version of) Prohorov theorem which actually holds on more general spaces than \mathbb{R} (actually any complete separable metric space).

Theorem 13.5 (Prohorov theorem on \mathbb{R}) If a collection of probability measure \mathcal{P} is tight then any sequence of measure \{P_n\} with P_n \in \mathcal{P} has a subsequence which converges weakly to some probability measure P.

Proof. The proof use the cumulative distribution function F_n(t)=P_n((-\infty,t]). Since for any t \in \mathbb{R} we have 0 \le F_n(t)\le 1. by Bolzano-Weierstrass there exists a subsequence F_{n_k} such that F_{n_k(t)} converges. Of course the subsequence n_k will depend a prioiri on t.

To construct the limit for any t we use a diagonal sequence argument: consider an enumaration r_1, r_2, \cdots of the rational \mathbb{Q}.

For r_1 there exists a subsequence n_{1,k} such that the limit exists and we set G(r_1)=\lim_{k \to \infty} F_{n_{1,k}}(r_1) \,.
For r_2 there exists a sub-subsequence n_{2,k} of n_{1,k} such that the limit exists and we set G(r_2)=\lim_{n \to \infty} F_{n_{2,k}}(r_2) \,. and note that we also G(r_1)=\lim_{n \to \infty} F_{n_{2,k}}(r_1).
Continuing in this way for r_j we have subsequence n_{j,k} and we set G(r_j)=\lim_{n \to \infty} F_{n_{j,k}}(r_j)
and note that F_{n_{j,k}}(r) converges for r=r_1,r_2, \cdots, r_j.
Finally consider the diagonal sequence n_k=n_{k,k} and we have for all j G(r_j)= \lim_{n \to \infty} F_{n_k}(r_j) since n_k is a subsequence of n_{j,k} for k \ge j.

We now define a function F on \mathbb{R} by setting
F(t) = \inf \left\{ G(r), r \ge t \text{ rational} \right\} Since G is non-decreasing, F is also non-decreasing and it is right continuous by construction.

We now use the tightness hypothesis and choose R so that P_n([-R,R]) \ge 1-\epsilon for all n simultaneoussy. This implies that F_n(t)\le \epsilon \text{ for } t \le -R \quad \text{ and } \quad F_n(t)\ge 1- \epsilon \text{ for } t \ge R \,. The same holds for the function G and finally also for the function F and we have F(t)\le \epsilon \text{ for } t \le -R \quad \text{ and } \quad F(t)\ge 1- \epsilon \text{ for } t \ge R \,. Since 0 \le F \le 1, F is right-continuos and decreasing and \epsilon is arbitrary this shows that F(t) \to 0 as t\to -\infty and F(t) \to 1 as t\to +\infty and F is the cumulative distribution function for some probability measure P.

To conclude we need to prove that F_{n_k}(t) converges to F(t) for all continuity points of F. Assuming that F(t_-)=F(t) we see that there exists r,s \in \mathbb{Q} such that F(t)-\epsilon < G(r) \le F(t) \le G(s) \le F(t) + \epsilon If k is large enough we have F(t)-2\epsilon < F_{n_k}(r) \le F_{n_k}(t) \le F_{n_k}(s) \le F(t) + 2\epsilon

Thus F(t)-2\epsilon < F(r) \le \liminf_{k} F_{n_k}(t) \le \limsup_{k} F_{n_k}(t) \le F(s) \le F(t) + 2\epsilon and since \epsilon is arbitrary \lim_{k}F_{n_k}(t) exists and must be equal to F(t). By Theorem 13.3 this shows that P_n converges weakly to P.

13.9 Weak convergence and characteristic function

A fundamental result to prove the central limit theorem is the following

Theorem 13.6 (Lévy continuity theorem) Let P_n be a sequence of probability measure on \mathbb{R} and \widehat{P}_n(t) their Fourier transforms.

P_n converges weakly to P implies that \widehat{P}_n(t) converges pointwise to \widehat{P}(t)
If \widehat{P}_n(t) converges pointwise to a function h(t) which is continuous at 0 then h(t)=\widehat{P}(t) is the Fourier transform of a measure P and P_n converges weakly to P.

Proof. For 1. just note that e^{itx} is bounded and continuous and thus weak convergence implies convergence to the charatersitic function for every t.

For 2. we show first that if \widehat{P}_n(t) converges to a function h(t) which is continuous at 0 then the sequence P_n is tight. By Fubini theorem \begin{aligned} \int_{-\alpha}^\alpha\widehat{P}_n(t) dt &= \int_{-\infty}^\infty \int_{-\alpha}^\alpha e^{itx} dt dP_n(x) = \int_{-\infty}^\infty \int_{-\alpha}^\alpha \cos(tx) dt dP_n(x) = \int_{-\infty}^\infty \frac{2}{x} \sin(\alpha x) dP_n(x) \end{aligned}

Next using the following easy bound 2\left(1-\frac{\sin(v)}{v}\right) \left\{ \begin{array}{cl} \ge 1 & \text{ if } |v| \ge 2 \\ \ge 0 & \text{ always } \end{array} \right. we obtain \begin{aligned} \frac{1}{\alpha}\int_{-\alpha}^\alpha(1-\widehat{P}_n(t)) dt & = \int_{-\infty}^\infty 2\left(1- \frac{\sin(\alpha x)}{\alpha x}\right) dP_n(x) \ge \int_{\alpha|x|\ge 2} dP_n(x) = P_n\left( \left[-\frac{2}{\alpha},\frac{2}{\alpha}\right]^c\right) \end{aligned} \tag{13.2}

Now since \widehat{P}_n(0)=1 for all n we have h(0)=1 and so by the assumed continuity of h we can choose \alpha sufficiently small so that \frac{1}{\alpha}\int_{-\alpha}^\alpha |1-h(t)|dt \le \frac{\epsilon}{2} \tag{13.3} By the bounded convergence theorem we have \lim_{n \to \infty} \frac{1}{\alpha}\int_{-\alpha}^\alpha |1-\widehat{P}_n(t)|dt = \frac{1}{\alpha}\int_{-\alpha}^\alpha |1-h(t)|dt and so combining Equation 13.2 with Equation 13.3 we find that for n large enough
P_n\left( \left[-\frac{2}{\alpha},\frac{2}{\alpha}\right]^c\right) \le \epsilon \,. This ensures that the sequence \{P_n\} is tight.

To conclude we invoke Theorem 13.5: for any subsequence P_{n_k} there exists a subsubsequence P_{n_{k_j}} which converges weakly to some probability measure P. By part 1. this implies that \lim_{j} \widehat{P}_{n_{k_j}}(t) converges \widehat{P}(t) which must then be equal to h(t). This shows that h(t) is the characteristic function for the probability measure P and this shows that the limit is the same for any choice of subsequence n_k. This implies that P_n converges weakly to P. \quad \square.

Example If Z is Poisson then E\left[e^{itZ}\right]=e^{\lambda(e^{i\lambda t}-1)}. Take Z_n Poisson with \lambda=n and set Y_n=\frac{Z_n-n}{\sqrt{n}} \begin{aligned} E[e^{itY_n}]&=&E\left[e^{i \frac{t}{\sqrt{n}}(Z-n)}\right] = e^{-i t\sqrt{n}} E\left[ e^{i \frac{t}{\sqrt{n}}Z}\right] =e^{-i t\sqrt{n}} e^{n\left(e^{i\frac{t}{\sqrt{n}}} -1 \right)}\\ &=& e^{-i t\sqrt{n}} e^{n\left( i\frac{t}{\sqrt{n}} - \frac{t^2}{2n} + O(n^{-3/2})\right) } = e^{-\frac{t^2}{2}} e^{ O(n^{-1/2})} \end{aligned} So Y_n converges weakly to a standard normal.
This is exactly the kind of computation that we will use to prove the central limit theorem in the next section.

13.10 Exercise

Exercise 13.1 (Continuity Theorem for convergence in distribution) Show that if X_n converges in distribution to X and f is a continuous function then f(X_n) converges in distribution to f(X).

Exercise 13.2 (Convergence of distributions for discrete random varables) Suppose X_n and X takes values in \mathbb{Z}. Show that X_n converges to X if and only if P(X_n=j) converges to P(X=j) for all j \in \mathbb{Z}.
Hint: For the “if” direction pick a finite set \Lambda \subset \mathbb{Z} such that \sum_{j \in \Lambda} P(X=j) \ge 1 -\epsilon.

Exercise 13.3 (Criterion for tightness) Suppose \phi is a non-negative function with \lim_{|x|\to \infty} \phi(x) = + \infty. Show that if C=\sup_{n}E[\phi(X_n)^2]< \infty then the sequence of random variable X_n is tight (that is the family of distribution P^{X_n} is tight).

Exercise 13.4

Show that if X_n converges to X in distribution and Y_n converges to Y in distribution and X_n and Y_n are independent for all n then X_n+Y_n converges to X+Y in distribution. Hint: Use the characetristic function.
Show with a counterexample that the assumption that X_n and Y_n are independent can, in general, not be dropped in part 1.

Exercise 13.5 Given independent identically distributed random variables X_1, X_2, \cdots, X_n with a common distribution function F(x) = P( X_j \le x) let M_n = \max_{1\le k \le n} X_k be the maximum.

Assume that for any finite x we have F(x) < 1 (this means that the X_j are unbounded). Show that \lim_{n \to \infty} M_n = + \infty \text{ ~almost surely}. Hint: Fix an arbitrary R and consider the event A_n = P(Y_n \le R). Apply then Borel-Cantelli Lemma.
Assume that we have F(x_0) =1 and F(x)<1 if x < x_0 (this means that X_j are bounded). Show that \lim_{n \to \infty} M_n = x_0 \text{ ~almost surely}. Hint: Argue as in 1.
Suppose that X_j are an exponential random variable with distribution function F(x) = 1 - e^{-x}. From part 1. we know that M_n diverges almost surely. In order to characterize this divergence show that
\lim_{n \to \infty} P(M_n - \log n \le x) = e^{ - e^{-x}} The random variable Z with distribution function P( Z \le x) = e^{ - e^{-x}} is called a Gumbell distribution.

14 Central limit theorem

The LLN asserts that for IID random variables the empirical mean $ $converge to the mean \mu=\E[X_i]. The central limit theorem describes the small fluctations around the mean. Informally it says that if the X_i are independent and identically distributed and have finite variance then \frac{S_n}{n} \approx \mu + \frac{\sigma}{\sqrt{n}} Z \quad \textrm {as } n \to \infty
where Z is a standard normal random variable.

14.1 Empirical finding

We take X_k to be uniform on \{-40, -39,\cdots, 40\} with mean \mu= 0 and variance \sigma^2=\frac{81^2 -1}{12}. For various value of n we generate m IID samples of \frac{S_n}{n} and then rescale them by \sqrt{n} to obtain a variance which is independent of n. We plot then an histogram of the values obtained, comparing with the pdf of a normal distribution with mean 0 and variance \sigma^2

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# number of sample in the sample mean
num = [1, 10, 100, 1000] 
# list of sample means
means = [] 

# number of realizations of the sample means

num_re = 10000

# Generating num random numbers from -40 to 40
# taking their mean and appending it to list means.
for j in num:
    x = [np.mean(
        np.random.randint(
            -40, 41, j)) for _i in range(num_re)]
    means.append(x)

k = 0
xrange = np.arange(-100,100,.1)

# plotting all the rescaled means in one figure
fig, ax = plt.subplots(2, 2, figsize =(8, 8))
for i in range(0, 2):
    for j in range(0, 2):
        # Histogram for each x stored in means
        ax[i, j].hist(np.multiply(np.sqrt(num[k]),means[k]), 100, density = True)
        ax[i,j].plot(xrange,norm.pdf(xrange, loc=0, scale=np.sqrt((81**2-1)/12)),label="pdf of normal with variance $\sigma$^2")
        ax[i, j].set_title(label=f"n={num[k]}")
        k = k + 1
plt.show()

14.2 The central limit theorem

Theorem 14.1 (Central Limit theorem) Suppose the random variables X_i are IID RVs with E[X_i]=\mu and \rm{Var}(X_i)=\sigma^2 for all i. Then Y_n=\frac{S_n-n\mu}{\sqrt{n} \sigma} converges in distribution to a standard normal random variable Z.

To understand and remember the the scaling, note that E[Y_n]= 0 \quad \quad Var(Y_n)= \frac{1}{n \sigma^2} {\rm Var}(S_n) = 1
Often, using one version of the convergence in distribution we write that for any a,b we have \lim_{n \to \infty} P\left( a \le \frac{X_1+ \cdots + X_n - n \mu}{\sqrt{n} \sigma} \le b \right) = \int_a^b \frac{e^{-x^2/2}}{\sqrt{2 \pi}} dx

Proof. We have done most of the work already! By Theorem 13.6 it enough to prove that the characteristic function E[e^{it Y_n}] converges to E[e^{itZ}]=e^{-t^2/2} for all t. We denote by \phi the characteristic function of the random variables \frac{X_i-\mu}{\sigma}. We have, using independence, \begin{aligned} \phi_{Y_n}(t)=E\left[e^{it Y_n}\right] = E\left[e^{i \frac{t}{\sqrt{n} \sigma} \sum_{k=1}^n (X_k-\mu) } \right] = \prod_{k=1}^n E\left[e^{i \frac{t}{\sqrt{n}} \frac{X_k-\mu}{\sigma} } \right] = \phi(\frac{t}{\sqrt{n}})^n \end{aligned} Since X_i has finite variance by Theorem 10.2, \phi(t) is twice continuously differentiable and we have \phi'(t)=i E\left[ \left(\frac{X_k-\mu}{\sigma}\right) e^{it \frac{X_k-\mu}{\sigma}}\right] \quad \quad \phi"(t)=- E\left[\left(\frac{X_k-\mu}{\sigma}\right)^2 e^{it\frac{X_k-\mu}{\sigma}}\right] and so \phi'(0)=0 and \phi"(0)=-1 a Taylor expansion around 0 gives \phi(t)= 1 - \frac{t^2}{2} + t^2 h(t) = 1- \frac{t^2}{2}(1 -h(t)) \quad \text{ with } \lim{t \to 0} h(t) =0 We have then \phi_{Y_n}(t) = \phi\left(\frac{t}{\sqrt{n}}\right)^n = \left(1 - \frac{t^2(1 - h(t/\sqrt{n}))}{n} \right)^n \to e^{-t^2/2} where we have used that if c_n \to c then (1+ c_n/n)^n \to e^c (by L’Hopital rule). \quad \square

14.3 Variations on the CLT

Modifying the proof slightly one can find

Theorem 14.2 Let X_i be independent random variables with E[X_i]=0 for all i and variance \sigma^2_{i}={\rm Var}(X_i). Assume \sup_{i}\sigma^2_i<\infty and \sum_{i} \sigma^2_i = \infty. Then \frac{S_n}{\sqrt{ \sum_{j=1}^n \sigma_j^2}} \to Z \quad \textrm{ in distribution} where Z is a standard normal.

and also there is a multidimensional version

Theorem 14.3 (multi-dimensional central limit theorem) Let X_i be IID \mathbb{R}^d-valued random variables. Let \mu=E[X_i] the vector or means and let Q be the covariance matrix Q={\rm Cov}(X_i, X_i). Then \frac{S_n - n \mu}{\sqrt{n}} \to Z \quad \textrm{ in distribution} where Z is Gaussian with mean vector \mu and covariance matrix Q.

14.4 Confidence intervals (version 1)

We build build confidence interval for \frac{S_n}{n}, since by the Central limit theorem \frac{\sqrt{n}}{\sigma} ( \frac{S_n}{n} - \mu) is asymptotically normal.

To build a \alpha-confidence interval we let z_\alpha the number defined by

\alpha = \frac{1}{\sqrt{2 \pi}} \int_{-z_\alpha }^{z_\alpha} e^{-\frac{x^2}{2}} \, dx \quad \quad \textrm{ for example } \quad \left\{\begin{array}{c} z_{.90}=1.645... \,\,(90\% \textrm{ confidence interval}) \\ z_{.95}=1.960... \,\,(95\% \textrm{ confidence interval}) \\ z_{.99}=2.576... \,\,(99\% \textrm{ confidence interval}) \end{array} \right.

By the CLT \displaystyle P\left( \mu \in \left[ \frac{S_n}{n} - z_\alpha \frac{\sigma}{\sqrt{n}}\,,\, \frac{S_n}{n} + z_\alpha \frac{\sigma}{\sqrt{n}} \right] \right) \lessapprox \alpha. \quad \textrm{ as } n \to \infty.

Approximate \alpha Confidence Interval P \left( \mu \in \left[\frac{S_n}{n} - \epsilon, \frac{S_n}{n} + \epsilon\right] \right) \lessapprox \alpha \quad \textrm{ provided } \quad n \ge z_\alpha \frac{\sigma^2}{ \epsilon^2}

The issue with that formula is that we are trying to compute \mu so there is no reason to believe that the variance \sigma^2 should be known! To remedy this issue we will use later the estimator for \sigma^2 built from our samples X_1, X_2, \cdots.

14.5 Applications of the CLT to Monte-Carlo method

In the spirit of the Monte-Carlo method the CLT provides a method to compare different MCMC methods to compute a number \mu. The idea is simple: given two Monte-Carlo estimator to compute \mu \frac{1}{n}\sum_{k=1}^n X_i \to \mu \quad \text{ and } \quad \frac{1}{n}\sum_{k=1}^n Y_i \to \mu choose the one with the smallest variance since by the central limit theorem the estimator with smallest variance will be more concentrated around \mu.

Example: comparing estimator to compute integrals Given a function h (without loss of generality with 0\le h\le 1 and defined on [0,1]) we have the estimators for \mu=\int_0^1 h(x) dx \frac{1}{n}\sum_{k=1}^n h(U_i) \to \int_0^1 h(x) dx \quad \text{ where } U_i \text{ uniform on} [0,1] and \frac{1}{n}\sum_{k=1}^n X_i \to \int_0^1 h(x) dx \quad \text{ where } X_i = \left\{ \begin{array}{cl} 1 & \text{if } U \le f(V) \\ 0 & \text{if } U > f(V) \end{array} \quad \text{ where } U, V \text{ uniform on} [0,1] \right.

Computing the variances we find {\rm Var}(h(U)) = \int_0^1 h(x)^2 dx - \left(\int_0^1 h(x) dx \right)^2 and {\rm Var}(X)= \mu(1-\mu) = \int h(x) - \left(\int_0^1 h(x) dx \right)^2 and since 0\le h(x) \le 1 we have h^2(x) \le h(x) and thus {\rm Var}(h(U)) \le {\rm Var}(X).

Importance sampling: Suppose we are trying to compute with a Monte-Carlo method (using a RV X with density f_X(x)) the integral E[h(X)]= \int h(x) f_X(x) dx.

Suppose for example that h(x)=1_{\{x \ge 4\}} and X is standard normal. Then E[h(X)]=P(X < 4)=0.00003 which is tiny. To have a meaningful estimate for \mu, the CLT gives S_n/n \approx \mu + \frac{\sigma}{\sqrt{n}} Z we must have \frac{\sigma}{\sqrt{n}} \ll \mu or n \gg \frac{\sigma^2}{\mu^2}.

The naive estimator using the Bernoulli RV Y=1_{\{X \ge 4\}} has variance {\rm Var}(Y)= P(X \ge 4)(1- P(X\ge 4)) \approx P(X \ge 4)=\mu and so we need n \gg \mu^{-1} samples.

The idea behind importance sampling is that in the previous estimator most samples are “lost”. Indeed most samples gives X<4 where h(x)=0. Instead we should change the sampling distribution so that most samples are greater than 4. The general principle is to use another density g_Y(y) for another random variable Y and write E[h(X)]= \int h(x) f_X(x) dx = \int \frac{h(x) f_X(x)}{g_Y(x)} g_Y(x) dx = E\left[\frac{h(Y)f_X(Y)}{g_Y(Y)} \right] which gives us another estimator whose variance is E\left[\left(\frac{h(Y)f_X(Y)}{g_Y(Y)}\right)^2 \right] - E\left[\frac{h(Y)f_X(Y)}{g_Y(Y)} \right]^2 = E\left[\left(\frac{h(Y)f_X(Y)}{g_Y(Y)}\right)^2 \right] - E\left[h(X)\right]^2 The potential gain is in the first term. For the example at hand we pick Y to be a shifted exponential with pdf g_Y(x)=e^{x-4} for x \ge 4 which ensures that all samples are exceeding 4 and thus will contribute something. To see if we gain something let us estimate \begin{aligned} E\left[\left(\frac{h(Y)f_X(Y)}{g_Y(Y)}\right)^2 \right] &= \int h(x)^2 \frac{f^2_X(x)}{g_Y(x)} dx = \int_4^\infty \frac{1}{2\pi} e^{-x^2/2} e^{ - x^2/2 +x -4} dx \\ &= \int_4^\infty \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \frac{1}{\sqrt{2\pi}} e^{-(x-1)^2/2} e^{-7/2} dx \le \frac{e^{-7/2} e^{-9/2}}{\sqrt{2\pi}} \int_4^\infty \frac{1}{\sqrt{2\pi}} e^{-x^2/2} dx = \frac{e^{-8}}{\sqrt{2 \pi}} \mu \end{aligned} so we gain a factor \frac{e^{-8}}{\sqrt{2 \pi}}=0.0000133... Impressive!

14.6 Slutsky theorem and applications

Slutsky is a very useful theorem with many applications. First we need a technical result (useful in its own right) which tells us that we only need to consider Lipschitz bounded functions to check for convergence in distribution.

Recall g is Lipschitz continuous if there exists a constant k such that {|g(x)-g(y)| \le k \|x-y\| for all x,y. Lipschitz functions are uniformle continuous and, functions which are differentiable with a bounded derivative \sup_{x}|g'(x)| < \infty are Lipschitz continuous.

Theorem 14.4 (Portmanteau Theorem) The sequence X_n converges to X in distribution if and only if \lim_{n \to \infty} E[f(X_n)] = E[f(X)] for all functions f which are bounded and Lipschitz continuous.

Proof. Suppose f is bounded with \alpha=\sup_{x}|f(x)| so that -\alpha \le f(x) \le \alpha. Then consider the functions h_k(x) = \inf_{y}\{f(y) + k\|x-y\| \} \quad \text{ and }\quad H_k(x) = \sup_{y}\{f(y) - k\|x-y\| \}

The functions h_k and H_k are bounded and increasing/decreasing sequences. We have \begin{aligned} -\alpha \le \inf_{y} f(y) \le \inf_{y} \{f(y) + k\|x-y\| \} = h_k(x) \le f(x) \le \sup_{y}\{f(y) - k\|x-y\| \}= H_k(x) \le \sup_{y} f(y) \le \alpha \end{aligned} and therefore we have -\alpha \le h_k \le h_{k+1} \le f(x) \le H_{k+1} \le H_{k} \le \alpha.

The functions h_k and H_k are Lipschitz continuous. Indeed we have \|x_1-y\| \le \|x_2-y\|+\|x_1-x_2\| and thus h_k(x_1) = \inf_{y} \{h(y)+ k\|x_1-y\| \} \le \inf_{y} \{h(y)+ k\|x_2-y\| \} + \|x_1-x_2\| = h_k(x_2) + k\|x_1-x_2\| Interchanging x_1 and x_2 shows that h_k is Lipschitz with constant k. The same holds for H_k(x).
We show next that \lim_{k \to \infty} h_k(x) = f(x) and \lim_{k \to \infty}H_k(x) = f(x).
Given \epsilon>0 pick \delta such that \|x-y\|\le \delta implies |f(x)-f(y)|\le \epsilon, and pick k_0 such that f(x) - \inf_{z} f(z) \le \delta k_0. Then f(y) + k_0\|x-y\| \ge \left\{ \begin{array}{cl} f(y) \ge f(x)-\epsilon &\text{ if} \|x-y\| \le \delta \\ \inf_{z} f(z) + k_0 \delta > f(x) & \text{ if} \|x-y\| \ge \delta \end{array} \right. and so \lim_{k} h_k(x) \ge h(x)-\epsilon which proves that \lim_{k} h_k(x) =h(x).
To conclude take f bounded and continuous and assume that for all Lipschitz and bounded functions h we have \lim_n E[h(X_n)]=E[h(X)]. We have that, for any k, since f \ge h_k, \liminf_n E[f(X_n)] \ge \liminf_n E[h_k(X_n)] = E[h_k(X_n)] and by the monotone convergence theorem \lim_{k} E[h_k(X)] = E[f(X)] and thus E[f(X)] \le \liminf_n E[f(X_n)]. In a similar manner, using H_k, one proves that E[f(X)] \ge \limsup E[f(X_n)] and thus E[f(X)] =\lim E[f(X_n)]. \quad \square.

Theorem 14.5 (Slutsky’s Theorem) If X_n converges to X in distribution and |Y_n-X_n| converges to 0 in probability then Y_n converges to X in distribution.

Proof. Use Theorem 14.4 consider a bounded Lipschitz function f with \sup_{x}|f(x)|\le M and |f(x)-f(y)|\le K|x-y| for all x,y.

For any \epsilon>0 we have \begin{aligned} |E[f(X_n)]-E[f(Y_n)]| &\le E[ |f(X_n)-f(Y_n)| 1_{\{|X_n-Y_n|< \epsilon\}} ] + E[ |f(X_n)-f(Y_n)| 1_{\{|X_n-Y_n|\ge \epsilon\}} ] \\ & \le K \epsilon + 2M P(|X_n-Y_n|\ge \epsilon) \end{aligned} Consequently \begin{aligned} |E[f(Y_n)]-E[f(X)]| & \le |E[f(Y_n)]-E[f(X_n)]| + |E[f(X_n)]-E[f(X)]| \\ &\le K \epsilon + 2M P(|X_n-Y_n|\ge \epsilon) + |E[f(X_n)]-E[f(X)]| \end{aligned} As n \to \infty the right hand side converges to \epsilon since the second term goes to 0 since |Y_n-X_n| converges to 0 in probability and |E[f(X_n)]-E[f(X)]| converges to 0 since X_n converges to X in distribution. Since \epsilon is arbitary this concludes the proof.\quad \square

In many applications the following result, also called Slutsky Theorem, is very useful.

Theorem 14.6 (Slutsky’s Theorem) Suppose X_n converges to X in distribution and Y_n converges to c in probability. Then

X_n+Y_n \to X+c in distribution.
X_n Y_n \to cX in distribution.
X_n /Y_n \to X/c in distribution (provided c \not =0)

Proof.

Consider the random variables (X_n,c). We show that (X_n,c) converges to (X,c) in distribution. Indeed for any bounded continous function f(x,y) consider the function g(x)=f(x,c). Since X_n converges to X in distribution then E[g(X_n)]=E[f(X_n,c)] converges to E[g(X)]=E[f(X,c)].
Now |(X_n,Y_n)-(X_n,c)|=|Y_n-c|. So if Y_n converges to c in probability then (X_n,Y_n) converges to (X_n,c) in probability.
Using Theorem 14.5 we conclude that (X_n,Y_n) converges to (X_n,c) in distribution.
We now can use the continuity theorem for convergence in distribution (see Exercise 13.1) using the continuous functions h(x_n,Y_n)= X_n+Y_n or X_nY_n or X_n/Y_n.

\quad \square

The central limit theorem states that \frac{S_n-n\mu}{\sqrt{n \sigma}} converges to a standard normal Z. If \sigma is not known can we replace it by the estimator for the variance estimator V_n = \frac{1}{n}\sum_k(X_k-\frac{S_n}{n})^2? The answer is yes, by applying Slutsky theorem

Theorem 14.7 (CLT using the empirical variance) Suppose the random variables X_i are IID RVs with E[X_i]=\mu and \rm{Var}(X_i)=\sigma^2 for all i. Then Y_n=\frac{S_n-n\mu}{\sqrt{n V_n}} converges in distribution to a standard normal random variable Z.

Proof. This follows from Theorem 14.5 since \sqrt{V_n} converges to \sigma in probability by the law of large numbers (and continuity theorem) and \frac{S_n-n\mu}{\sqrt{n \sigma}} converges to Z in distribution.

This is the standard way the CLT is used in statistical applications. For example……

14.7 The \delta-method

Another nice application is the so-called \delta-method which is some kind of non-linear version of the CLT. To this it is convenient to rewrite the CLT as \sqrt{n} \left(\frac{S_n}{n} - \mu\right) \to Y \quad \text{ in distribution } where Y is normal with variance \sigma^2. By the continuity theorem if g is a continuous function then g\left(\frac{S_n}{n}\right) converges to g(\mu) almost surely and it is natural to ask whether we have a central limit theorem. The answer is yes provided g is differentiable and it is provided by the following theorem. We only show the 1d version.

Theorem 14.8 (\delta-method) Suppose Y is normal with mean 0 and variance \sigma^2 and we have \sqrt{n}\left(\frac{S_n}{n} - \mu\right) \to Y \quad \text{ in distribution.} Assume g:\mathbb{R} \to \mathbb{R} is continuously differentiable with g'(\mu)\not =0 then \sqrt{n} \left(g\left(\frac{S_n}{n}\right) - g(\mu)\right) \to Y' \quad \text{ in distribution,} where Y' is normal with mean 0 and variance \sigma^2 g'(\mu)^2.

Proof. Taylor expansion around \mu gives
g(\mu+h)= g(\mu) + g'(\mu)h + h r(h) \quad \text{ with } \lim_{h \to 0 } r(h) = 0 Applying this to h=\frac{S_n}{n}-\mu we find \sqrt{n}\left( g\left(\frac{S_n}{n}\right) - g(\mu)\right) = \sqrt{n} g'(\mu) \left(\frac{S_n}{n}-\mu\right) + \sqrt{n} \left(\frac{S_n}{n}-\mu\right) h\left(\frac{S_n}{n}-\mu\right) \tag{14.1} Here comes Slutsky’s Theorem in action. On one hand \frac{S_n}{n}-\mu converges to 0 in probability and since h is continuous at 0 then h\left(\frac{S_n}{n}-\mu\right) converges to 0 in probability as well by Theorem 11.4. Since \sqrt{n} \left(\frac{S_n}{n}-\mu\right) converges to Y in probability Theorem 14.6 implies that the last term converges to 0 in distribution. But as we have seen in Theorem 13.1 convergence to a constant in distribution implies convergence in probability. We can now apply Theorem 14.5: the first term on the right hand side of Equation 14.1 converegs in distribution to a normal with variance \sigma^2|g'(\mu)|^2 and the second term converges in probability to 0, therefore the left hand side of Equation 14.1 converges in distribution a normal with variance \sigma^2|g'(\mu)|^2. \quad \square.

Example
Suppose we are interested in the distribution of Y_n = e^{S_n/n} = \left(\prod_{k=1}^n e^{X_k}\right)^n, a type of model used in financial applications. Then we have g'(\mu)=e^\mu and the delta method gives \sqrt{n}\left( e^{\frac{S_n}{n}} - e^\mu \right) \to Y' \quad \text{ in distribution} with Y' is normal with zero mean and variance \sigma^2 e^{2\mu}.

Example
Suppose we have Bernoulli random variables X_1, X_2, \cdots, X_n and we are interested in the odds of success that is the ratio \frac{p}{1-p}. (For gambling, often the odds of sucess are given instead of the probability of success). An estimator for the odds is given by Y_n= \frac{\frac{S_n}{n}}{1-\frac{S_n}{n}} so we can apply the \delta method with g(x)=\frac{x}{1-x} so g'(x)=\frac{1}{(1-x)^2}. The delta methods tells us that \sqrt{n}\left(Y_n - \frac{p}{1-p}\right) \to Y \quad \text{ in distribution} where Y is normal with variance p(1-p)g'(p)^2=\frac{p}{(1-p)^3}.