The axioms of probability were first formalized by Andrey Kolmogorov in 1933 but probability theory started much earlier.
The Doctrines of Chances by Abraham de Moivre in 1718 is usually considered as the first probability textbook and de Moivre also first proved a version of the central limit theorem in 1733.
The Ars conjectandi by Jacob Bernoulli in 1713 has the first proof of the Law of Large numbers and the subsequent work by Pierre Simon Laplace Théorie analytique des probabilités in 1819 present a formalization of probability theory.
The mathematical foundations, at the very bottom, of Probability relies on measure theory developed earlier by the likes of Emile Borel and Henri Lebesgue and many others.
According to a remark attrubuted Marc Kac: probability theory is measure theory with a soul
Probability is the new kid on the mathematics block (after Geometry, Algebra and Analysis) but of course the coolest kid.
Probability theory provides the mathematical foundations and the language for Statistics, Ergodic Theory, Statistical Mechanics, Information Theory, and (a big part of) Machine Learning.
1.1\sigma-algebras
The state space or sample space \Omega is some abstract space and we denote subsets of \Omega by capital letters A,B,\cdots. We use the language of set theory
A^c= E\setminus A, \quad A \cap B, \quad A \cup B, \quad A \setminus B, \quad A \supset B, \quad \emptyset,
Definition 1.1 (partition) A partition of \Omega is a collection of sets, (A_i)_{i=1}^\infty, such that
A_i \cap A_j =\emptyset (pairwise disjoint) for all i \not=j.
\bigcup_{i=1}^\infty A_i = \Omega
Intuition:
\Omega = collection of all possible outcomes of an experiment
A = event = collection of all outcomes compatible with the event A
Write 2^\Omega for the collection of all subsets of \Omega and we denote by calligraphic letter \mathcal{A}, \mathcal{E}, \mathcal{F}, ... collections of subsets of \Omega (that is subsets of 2^\Omega).
We introduce natural collection of subsets
Definition 1.2 (algebra) A collection of set \mathcal{E} is an algebra if
\emptyset \in \mathcal{E} and \Omega \in \mathcal{E}.
\mathcal{E} is closed under complement: A \in \mathcal{E} \implies A^c \in \mathcal{E}.
\mathcal{E} is closed under finite union and intersection: A_1, \cdots, A_n \in \mathcal{E} \implies \begin{array}{c}\bigcup_{i=1}^n A_i \in \mathcal{E} \\ \bigcap_{i=1}^n A_i \in \mathcal{E}\end{array}.
Definition 1.3 (\sigma-algebra) A collection of set \mathcal{A} is an \sigma-algebra if
\emptyset \in \mathcal{A} and \Omega \in \mathcal{A}.
\mathcal{A} is closed under complement: A \in \mathcal{A} \implies A^c \in \mathcal{A}.
\mathcal{A} is closed under countable union and intersection: A_1, A_2, \cdots, \in \mathcal{A} \implies \begin{array}{c}\bigcup_{i=1}^\infty A_i \in \mathcal{A} \\ \bigcap_{i=1}^\infty A_i \in \mathcal{A}\end{array}.
Examples of \sigma-algebras: \mathcal{A} = \left\{ \emptyset, \Omega \right \}, \mathcal{A}=2^\Omega, \mathcal{A} = \left\{ \emptyset, A, A^c, \Omega \right \} \mathcal{A} = \left\{ \emptyset, A, B, A^c, B^c, A\cup B, A \cap B, A\setminus B, B \setminus A, A^c \cap B^c, \Omega \right \}
Definition 1.4 (\sigma-algebra generated by \mathcal{C}) If \mathcal{C} is a collection of subsets of \Omega, the \sigma-algebra generated by \mathcal{C}, denoted by \sigma(\mathcal{C}), is the smallest \sigma-algebra containing \mathcal{C}.
Remark:\sigma(\mathcal{C}) always exists since, alternatively, you can think of it as the intersection of all \sigma-algebras which contains \mathcal{C}. Indeed we always have \mathcal{C} \subset 2^\Omega and arbitrary intersections of \sigma-algebras are \sigma-algebras (see Homework for more details).
If the space \Omega has a topology (e.g \Omega = \mathbb{R}) then it natural to pick a \sigma-algebra compatible with open set.
Definition 1.5 (The Borel \sigma-algebra) If \Omega=\mathbb{R} the Borel \sigma-algebras \mathcal{B} is the \sigma-algebra generated by the collection of all open sets (or, equivalently, by the closed sets).
The sets in the Borel \sigma-algebras \mathcal{B} are called Borel sets.
Remark: The Borel \sigma-algebra of \mathbb{R} is strictly smaller than 2^\mathbb{R} (see Math 623 for a proof (existence of non-measurable sets)). The \sigma-algebra 2^\mathbb{R} is too big to useful!
An important characterization of the Borel \sigma-algebra of \mathbb{R} is the following. It will play a big role to characterize random variables
Theorem 1.1 The Borel \sigma-algebra of \mathbb{R} is generated by the countable collection of intervals of the form
(-\infty, a] \quad \text{where} \quad a \in \mathbb{Q}
Proof. Let \mathcal{C} be the collection of all open set.
If A is an open set, then A can be written as the union of disjoint intervals A = \cup_{i}(a_i, b_i).
An interval (a,b) can be written as (a,b) = (-\infty,b) \setminus (-\infty,a].
Taking a sequence b_n \in \mathbb{Q} with b_n \nearrow b we have (-\infty,b)= \bigcup_{n}(-\infty, b_n].
Taking a sequence a_n \in \mathbb{Q} with a_n \searrow a we have (-\infty,a]=\bigcap_{n}(-\infty,a_n].
If \mathcal{D} is the collection of interval (-\infty,a], with a \in \mathbb{Q} then we have shown that \mathcal{C} \subset \sigma(\mathcal{D}) and therefore \sigma(\mathcal{C}) \subset \sigma(\mathcal{D}). But since \mathcal{D} consists of closed sets (which are Borel sets) we have \mathcal{D} \subset \sigma(\mathcal{C}) and thus \sigma(\mathcal{D}) \subset \sigma(\mathcal{C}). \quad \square
1.2 Axioms of Probability
Two simple axioms underlie probability theory.
Definition 1.6 (Probability measure) A probability measure defined a on a \sigma-algebra \mathcal{A} is a function
P : \mathcal{A} \to [0,1]
with the following properties
P(\emptyset)=0 and P(\Omega)=1
(Countable additivity) For any countable collection of pairwise disjoint sets A_i (i.e. A_i \cap A_j = \emptyset for all i\not=j ) we have
P\left( \bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)
A Probability space(\Omega, \mathcal{A}, P) consists of the set \Omega a \sigma-algebra \mathcal{A} and a probability measure P.
Remark: As we shall see, requiring finite additivityP\left( \bigcup_{i=1}^n A_i\right) = \sum_{i=1}^n P(A_i) for finite n would not be sufficient.
Proof. Let us assume first countable additivity. If A_1 \subset A_2 \subset A_3 \subset \cdots define the disjoints set B_1=A_1, B_2=A_2\setminus A_1, B_3=A_3 \setminus A_2, \cdots. Then for any N, B_1 \cup \cdots \cup B_N=A_N and \bigcup_{n=1}^\infty A_n = \bigcup_{n=1}^\infty B_n. By countable additivity
P(A)=P\left(\bigcup_{n=1}^\infty A_n \right)=P\left(\bigcup_{n=1}^\infty B_n \right)=\sum_{n=1}^\infty P(B_i) = \lim_{N \to \infty}\sum_{n=1}^N P(B_i)=\lim_{N \to \infty}P(A_N).
If A_1 \supset A_2 \supset A_3 \supset \cdots then taking complement we have A_1^c \subset A_2^c \subset A_3^c \subset \cdots and using the above and de Morgan’s law
P(A) = P\left(\bigcap_{n=1}^\infty A_n \right) = 1 - P\left(\bigcup_{n=1}^\infty A_n^c \right) = 1 - \lim_{N\to\infty} P\left(\bigcup_{n=1}^N A_n^c\right)=1 - \lim_{N\to\infty} P\left(A_N^c\right) = \lim_{N\to\infty} P(A_N)
For the converse statement suppose A_i are pairwise disjoint and set B_N = \bigcup_{n=1}^N A_n. Then B_N \nearrow B=\bigcup_{n=1}^\infty A_n. Using finite additivity and continuity we find
P\left(\bigcup_{n=1}^\infty A_n\right)=P(B)= \lim_{N \to \infty} P(B_N) = \lim_{N \to \infty} \sum_{n=1}^N P(A_n) = \sum_{n=1}^\infty P(A_n)
which prove countable additivity.
\quad\square
1.5 More on limits of sets
limsup and liminf of sets: For an arbitrary collection of sets A_n we define the limsup and liminf by
\begin{aligned}
\limsup_{n\to \infty} A_n &= \bigcap_{n=1}^\infty \bigcup_{m \ge n} A_m = \left\{ \omega \in \Omega \,; \omega \in A_n \text {for infinitely many } n \right\} \\
\liminf_{n\to \infty} A_n &= \bigcup_{n=1}^\infty \bigcap_{m \ge n} A_m = \left\{ \omega \in \Omega \,; \omega \in A_n \text {for all but finitely many } n \right\}
\end{aligned}
The fact that the two definitions coincide requires some thoughts (see homework for more details). For example if \omega \in \bigcap_{n=1}^\infty \bigcup_{m \ge n} A_m then \omega \in \bigcup_{m \ge n} A_m for every n. Taking n=1 we find that there exist some k_1\ge 1 such that \omega \in A_{k_1}. Taking next n=k_1+1 we see that there exists k_2 >k_1 such that \omega \in A_{k_2}, and so on. Therefore \omega belongs to infinitely many A_n.
Characteristic function of a set A:1_A(\omega)= \left\{ \begin{array}{cl} 1 & \omega \in A \\ 0 & \omega \notin A \end{array} \right.
Limits of sets: We say that A_n converge to A if \lim_{n} 1_{A_n}(\omega) = 1_A(\omega) for all \omega.
Theorem 1.4 If A_n converges to A then \lim_{n\to \infty} P(A_n)=P(A).
Proof. If \lim_{n} 1_{A_n}(\omega) = 1_A(\omega) for some \omega then the sequence is either 1 (if \omega \in A) or 0 (if \omega \notin A) for all but finitely many n. Therefore if A_n converges to A this means
A= \limsup_{n}A_n = \liminf_{n} A_n.
Set B_n = \bigcap_{m\ge n} A_m and C_n =\bigcup_{m\ge n} A_m. Then we have B_n \subset A_n \subset C_n and thus, by monotonicity
P(B_n) \le P(A_n) \le P(C_n).
Since the sequence B_n is decreasing to \liminf_{n} A_n and and and the sequence C_n is increasing to \limsup_{n} A_n and A= \limsup_{n}A_n = \liminf_{n} A_n we have \lim_{n\to \infty} P(A_n)=P(A). \quad\square.
1.6 Conditional Probability and Independence (see Stat 515)
The intuition behing conditional probabilities is as follows. Suppose you observe that the vent B has occurred (e.g. “it rained yesterday”). How does that influence the probability of another event A (e.g. there was wind yesterday or it is raining today)?
For \omega \in A to be a possible outcome we must have \omega \in A \cap B. So only events of the form A \cap B are relevant. Normalizing the probabilities leads to the following
Definition 1.7 Given B \in \mathcal{A} with P(B)>0, the conditional probability of A given B is defined by
P(A | B) = \frac{P(A \cap B)}{P(B)}
If the occurence of B does not influence the probability that A occurs, that is if P(A|B)=P(A) then we will call theses events independent. This means P(A|B) = P(A) \iff P(A \cap B) = P(A) P(B) \iff P(B|A) = P(A).
Definition 1.8 (Independent events)
Two events A and B are independent if P(A \cap B) = P(A) P(B)
A collection of events (A_i)_{i \in I} (I possibly be infinite) if for any finiteJ \subset I, P(\bigcap_{i \in J}A_i) = \prod_{i \in J}P(A_i)
Theorem 1.5 (Properties of conditional probability)
Independence: If A and B are independent so are A and B^c, A^c and B, and A^c and B^c.
Probability property: If P(B)>0A \mapsto P(A|B) defines a probability on \mathcal{A}.
Probability property: P(\emptyset|B) = \frac{P(\emptyset)}{P(B)}=0 and P(\Omega|B) = \frac{P(\Omega \cap B)}{P(B)}=1. For countable additivity
P (\cup_n A_n |B ) = \frac{P \left(\cup_{n} (A_n \cap B) \right) }{P(B)} =\sum_{n} \frac{ P(A_n \cap B)}{P(B)} = \sum_{n}P(A_n|B).
Product rule: Use repeatedly P(A \cap B) =P(B|A) P(A) which is just the definition. Then
P(A_1 \cap \cdots \cap A_n)= P(A_n|A_1 \cap A_{n-1}) P(A_1 \cap \cdots \cap A_{n-1}) = \cdots
Bayes: Use the defintion and conditioning:
P(E_m|A) = \frac{P(A\cap E_m)}{P(A)} = \frac{P(A|E_m) P(E_m)}{P(A)} = \frac{P(A|E_m) P(E_m)}{\sum_{n}P(A|E_n) P(E_n)}.
1.7 Take-home messages
\sigma-algebra would not be necessary if we work only with discrete sample space.
But to describe sample space like \mathbb{R} or a countable collection of discrete models (think coin flip) they are necessary (recall that 2^\mathbb{N} has the same cardinality as \mathbb{R}). In the dark corners of real numbers various monsters are lurking that need to be tamed (and then ignored).
At a deeper and more interesting level \sigma-algebra will occur later in conditional expectation, martingales and stochastic processes, they will be use as information-theoretic tool and will describe sets of questions or inquiries you can perform on a model.
1.8 Homework problems
Exercise 1.1
Suppose \{\mathcal{A}_j\}j \in J is an arbitrary collection of \sigma-algebras (J does not need to be countable). Show that the intersection \cap_{j\in J} \mathcal{A}_j is a \sigma-algebra.
Is the union of a \sigma-algebras a \sigma-algebra? Prove or disprove.
Exercise 1.2 Suppose \mathcal{A} is a \sigma-algebra and let B \in \mathcal{A} be some set. Define \mathcal{B} = \{ A \cap B\,;\, A \in \mathcal{A} \}.
Show that \mathcal{B} is a \sigma-algebra.
Show that the conditional probability P(A|B) is a probability on the \sigma-algebra \mathcal{B}.
Exercise 1.3
Prove the properties of probability measures given in Theorem 1.2
Prove that two definitions for limsup and liminf of sets given in Section 1.5 are equivalent.
Show that \limsup_n A_n^c = (\liminf_n A_n )^c
I found online this pretty instructive illustration of limsup and liminf (slightly edited).
A tech company is closing down and firing the set \Omega of all his employess who become beggars and have to live on the street (in this story people never die). The local church decides to start to give out free food for them every day. On the n^{th} day A_n is the subset of fired employees who show up at church to get fed. There are three categories of people:
Some of the people eventually get a new job and never show up at the church again.
Others are too proud and try be seen around all the time, but they need to eat so they always come back eventually.
Lastly there are the people who after trying everything else, eventually give up start to get their food from the church each day.
Express the categories of people in 1., 2., 3. in terms of limsup and liminf? What needs to happen for \lim_n A_n to exist?
Exercise 1.5 Consider the following lottery game: On N lottery tickets (visibly) numbered 1 to N and the numbers 1 to N are randomly assigned hidden under the scratch pad. (Each number comes up exactly once, in other words one is picking a random permutation of the N numbers). If the randomly assigned number match the visible numbers then you win. This lottery has the remarkable property that the probability that nobody wins is essentially independent of N.
Consider the events A_i=\{ \textrm{ticket } i \text{ is a winner}\}.
Compute P(A_i) and P(A_i \cap A_j) for i\not= j using the product rule.
Use the inclusion-exclusion formula to compute the probability that there is at least one winner.
Show that if N is even moderately big (maybe N\ge 5 or 6 will do) this probability that nobody wins is for all practical purpose independent of N and nearly equal to \frac{1}{e}.
Exercise 1.6 (Polya’s urn) An urn contain b blue balls and r red balls. A ball is drawn from the urn, its color is noted, and it is returned to the urn together with d balls of the same color. This process is repeated indefinity. Denote by B_n the event that the n^{th} drawn ball is blue.
Compute the probability that the second ball is blue, i.e. P(B_2)
Compute the probability the first drawn ball is blue given that the second drawn ball is blue?
Show that the probbaility that n ball drawn P(B_n) is independent of n.
Compute the probability that the first drawn ball is blue given that the n subsequent drawn balls are blue.
2 Probability and Random Variables on countable state space
2.1 Probabilities on countable state spaces
If \Omega=(\omega_1, \omega_2, \dots) is countable sample space take as \sigma-algebra \mathcal{A}=2^{\Omega} to the collection of all susbets of \Omega.
Specifying a probability on \mathcal{A} is equivalent to choosing a collection of numbers p_n=P(\{\omega_n\}) which the probability of the event \{\omega_n\} with
p_n \ge 0 \quad \text{ and } \quad \sum_{i} p_n =1
and then for any A \subset \Omega we have
P(A) = \sum_{\omega \in A} P(\{\omega\})
Countable additivity reduces to the fact that for absolutely convergent series we can freely interchange of summations P( \cup_{i} A_i) = \sum_{\omega \in \cup_{i} A_i} p(\omega) =\sum_{i} \sum_{\omega \in A_i} p(\omega) = \sum_i P(A_i) if the A_i are pairwise disjoint.
We recall some standard probability models
Definition 2.1 (Poisson) A Poisson distribution with parameter \lambda>0 is a probability distribution on \{0,1,2,3, \cdots\} with
p_n = e^{-\lambda} \frac{\lambda^n}{n!}.
The Poisson distribution is ubiquitous, in particular because of its relation with the binomial distribution (sometimes called The Law of small numbers see Theorem 2.1). Typical examples are the number of typos in a page, the number of earthquakes hitting a region, the number of radiactive atoms decay, etc…
Definition 2.2 (Pareto or Zeta distribution) A Pareto distribution with decay rate \alpha>0 is a probability distribution on \{1,2,3, \cdots \} with
p_n = \frac{1}{\zeta(\alpha+1)} \frac{1}{n^{\alpha+1}} \quad \text{ where } \quad \zeta(s)=\sum_{n=1}^\infty \frac{1}{n^s} \text{ Riemann zeta function }
The Pareto distribution is a model with polynomial tails. It is widely used in economics where polynomial tails often occurs. Classical example is the the wealth distribution in a population, or the size of cities, or the number of casualties in wars. See the Pareto principle (“20% of the population controls 80% of the wealth”) and the examples in this paper
Definition 2.3 (Independent Bernoulli trials) The model is characterized by an integer n (= number of trials) and by a number 0\le p \le 1 (=the probability of success in each trials). We assume that the trials succeed or fails independently of each other. The state space is \Omega=\{0,1\}^n and we write \omega=(\omega_1,\cdots, \omega_n) with
\omega_{i} = \left\{
\begin{array}{cl}
0 & \text{ if } i^{th} \text{ trial fails} \\
1 & \text{ if } i^{th} \text{ trial succeeds}
\end{array}
\right.
If we set \|\omega\|_1=\sum_{i=1}^n \omega_i which counts the number of 1’s in \omega then
P(\{\omega\}) = p^{\|\omega\|_1} (1-p)^{n- \|\omega\|_1}
Definition 2.4 (Binomial random variable) The binomial distribution describe the the number of success in n Bernoulli trials. It is a probability distribution on \{0,1,\cdots,n\} with
p_k = {n \choose k} p^k (1-p)^{n-k}
where {n \choose k}=\frac{n!}{k!(n-k)!} is the number of ways k successes can occur among n trials.
Definition 2.5 (Geometric distribution) The geometric distribution describes when the first succesful trial occurs in a series of Bernoulli trials. It is a probability distribution on \{1,2, 3\cdots,\} with
p_n = (1-p)^{n-1}p
since we have unsuccesful trial before the first succesful ones.
Definition 2.6 (Negative binomial or Pascal’s distribution) The negative binomial distribution describes when the k^{th} successful trial occurs in a series of Bernoulli trials. It is a probability distribution on \{k, k+1,k+2 \cdots\} with
p_n = { n-1 \choose k-1} (1-p)^{n-k} p^k
Sometimes the negative binomial is defined slightly differently and counts the number of failure until the k^{th} success occurs.
2.2 Poisson approximation
If the number of trial n \gg 1 is large and the probability of success if small p \ll 1, then the binomial distribution can be approximated by a Poisson distribution with parameter \lambda = np. A formal mathematical result is the following (proof in the homework)
Theorem 2.1 (The Law of small numbers) Suppose the probability of success, p_n, varies with n in such a way that \lim_{n\to \infty} np_n=\lambda (take p_n=\frac{\lambda}{n}) then we have
\lim_{n \to \infty} { n \choose k } p_n^k (1-p_n)^{n-k} = e^{-\lambda} \frac{\lambda^k}{k!}
Example: Birthday problem
There are N students in class. Event A=“at least two students share the same birthday”. . . .
p= probability that one pair of students have the same birthday? It is p=\frac{1}{365}. . . .
Number of trials here is the number of pair of students n = {N \choose 2}= \frac{N(N-1)}{2}. Note:T rials are weakly dependent here. . . .
Poisson approximation \lambda=np={N \choose 2}\frac{1}{365} and so P(A^c) \approx e^{-{N \choose 2} \frac{1}{365}}. . . .
Exact value P(\text{no pair share the same birthday}) = \frac{365}{365} \frac{364}{365} \cdots \frac{ 365 -N+1}{365}
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import poissonfrom scipy.stats import binomn =40# Number of trialsp =0.1# Probability of successlam = n*p # paramter of the Poisson distribution # Values to evaluate the PDF atx = np.arange(0, 15)# Calculate the PDF of the Poisson distributionpdfP = poisson.pmf(x, mu=lam)pdfB = binom.pmf(x, n, p)# Plot the PDFplt.figure(figsize=(5,3)) plt.bar(x, pdfP, color='blue', width=0.5, label=f"Poisson $\lambda$={lam}")plt.bar(x, pdfB, color='red', alpha=0.5, label=f"Binomial' n={n}, p={p}")plt.title('Poisson Approximation')plt.xlabel('k')plt.ylabel('PDF')plt.legend()plt.grid(True)plt.show()
2.3 Random variable on countable state spaces
You can think of a random variable as extracting a quantity that can vary (hence the name “variable”) by observing outcomes \omega \in \Omega. The probability defined on \Omega implies that this variable is random, hence the name.
Definition 2.7 (Random Variables on discrete state space) Suppose \Omega is countable. A map
X: \Omega \to T
with target T (e.g. a subset of \mathbb{R} or \mathbb{N} or … ) is called a random variable (RV).
Since \Omega is countable so is its image T'=\{T(\omega), \omega \in \Omega\} \subset T so we could assume T is countable as well. The distribution of X, also called the law of X is a probability measure on T called P^X, induced from the probbaility P on \Omega and defined as follows
P^X(A) = P(\{ \omega\,;\, X(\omega) \in A\}) = P(X^{-1}(A)) =P( X \in A).
It particular for j \in T' we have
p^X_j = P(X =j) = \sum_{\omega\,;\, X(\omega)=j} p_j
2.4 Sampling with or without replacement
A urn with N balls, b blue balls and r=N-b red balls. We select n balls out of the N balls either with or without replacing balls after choosing them.
Sample space \Omega=\{0,1\}^n with \omega =(\omega_1, \cdots, \omega_n) where \omega_i=0 means getting a red ball and \omega_i=1 means getting a blue ball on the i^{th} selection.
Random variable X(\omega)= \|\omega\|_1 = \sum_i \omega_i which is the number of red balls and take values in \{0,1,\cdots,n\}
Definition 2.8 (Sampling distributions: binomial and hypergeometric)
If we sample with replacement, the probability to get a blue ball is alaways p=\frac{b}{b+r} and we get a binomial distribution
p^X_k = P(X=k) = {n \choose k}p^k (1-p)^{n-k}.
If we sample without replacement, then we get a hypergeometric distribution
p^X_k = P(X=k) = \frac{{b \choose k}{N-b \choose n-k}}{N \choose n}
2.5 Expectation
Definition 2.9 If \Omega=\{\omega_1, \omega_2, \cdots\} is countable and P is a probabinility measure on \Omega, we define the expectation of a random variable X: \Omega \to \mathbb{R} by
E[X] = \sum_{i} X(\omega_i) p_i \,.
The expectation is well defined if X \ge 0 (in which case the sum could possible infinite) or if the series is absolutely convergent \sum_{i}|X(\omega_i)| p_i < \infty.
We denote by L^1 = \{ X: \Omega \to \mathbb{R}, \sum_{i}|X(\omega_i)| p_i \le \infty\} the set of random variable with a finite expectation E[X].
L^1 is a vector space and we have E[aX+Y] = a E[X] + E[Y] (linearity of the expectation).
Examples:
Poisson RV:X: \Omega \to \mathbb{N} and P(X =j)=e^{-\lambda}\frac{\lambda^j}{j!}. Then
E[X]=\sum_{j=0}^\infty j e^{-\lambda}\frac{\lambda^j}{j!} \sum_{j=1}^\infty e^{-\lambda}\frac{\lambda^j}{(j-1)!}=\lambda \sum_{j=0}^\infty e^{-\lambda}\frac{\lambda^j}{j!}
=\lambda
Bernoulli (Indicator) Random variable: Given an event A define X_A: \omega \to \{0,1\} by
X_A(\omega) = \left\{ \begin{array}{cl} 1 & \omega \in A \\ 0 & \omega \notin A\\ \end{array}\right.
Then
E[X_A] = 1 P(A) + 0(1-P(A)) =P(A)
Binomial Random variable: The Binomial RV can be written as sum of n Bernoulli. From this we see immediately that E[X]=np.
Zeta RV: If X is has zeta distribution with parameter \alpha>0 then
E[X] = \sum_{n=1}^\infty n \frac{1}{\zeta(\alpha+1)} \frac{1}{n^{\alpha+1}} = \frac{1}{\zeta(\alpha+1)} \sum_{n=1}^\infty \frac{1}{n^{\alpha}} =
\left\{ \begin{array}{cl} \frac{\zeta(\alpha)}{\zeta(\alpha+1)} & \alpha >1 \\ +\infty & 0 < \alpha \le 1 \end{array} \right.
Geometric RV: We have with q=1-p
E[X]=\sum_{n=1} nq^{n-1} p = p \frac{d}{dq} \sum_{n=0}^\infty q^n = p \frac{d}{dq} \frac{1}{1-q} = p \frac{1}{(1-q)^2} = \frac{1}{p}
Negative binomial RV: Since a negative binomial with paramter k is the sum of k geometric RV we have E[X]=\frac{k}{p}.
Uniform distribution on \Omega=\{1,2,\cdots, N\}: we have p_j=\frac{1}{N} for every j and thus
E[X] = \frac{1}{N}(1 + 2 + \cdots +N) = \frac{1}{N} \frac{N(N+1)}{2}=\frac{N+1}{2}
A RV wihtout expectation: Suppose p_n = c_n \frac{1}{n^2} for n \in \mathbb{Z}\setminus \{0\} and p_0=0 and for a suitable normalization c_n.
Then the series for the expecation
\sum_{n=-\infty}^\infty n p_n
is undefined because \sum_{n> 0}np_n =\infty and \sum_{n< 0}np_n = -\infty.
Use a Poisson approximation to compute the probability that at least 3 people in a room of N share the same birthday.
Revisit the lottery problem of Exercise 1.5. Use a Poisson approximation to estimate the probability that there are exactly k winners.
(Optional) Show that for the lottery problem the probability that there are exactly k winners is \frac{1}{k!}\sum_{j=0}^{n-k} \frac{(-1)^j}{j!}. Hint: If there are exactly k winners then n-k players are not winners (the probability of that event is known).
Exercise 2.3
Prove that the expected value of a hypergeometric random variable is E[X]=n\frac{b}{r+b}. Hint: Write X=X_1+X_2+ \cdots + X_n where X_i is the number of blue ball in the i^{th} trial. Compute first E[X_1] and then argue by symmetry.
In the powerball at each drawing 5 balls are selected at random among 69 white balls numbered 1 to 70 and 1 ball is seleccted among 26 red balls numbered 1 to 26. A powerball ticket costs $2 and consists of selected 5+1 numbers. You get a prize if the balls selected math the winning balls see the prizes here.
Express the probability of each prize using the hypergeometric distribution.
All the prizes are fixed except the jackpot obtained for 5 correct white ball plus the red powerball. You will observe that most people will play only when the jackpot is big enough. As a rational mathematician determine the minimal jackpot for which it makes sense to buy a powerball ticket?
Exercise 2.4 Suppose X is a geometric random variable with success parameter p.
Show that P(X> m+n|X>n)=P(X>m)? What does that mean?
Show that E\left[ \frac{1}{X}\right] =\log \left( p^{\frac{p}{p-1}} \right)
Show that E\left[ X(X-1)(X-2) \cdots (X-r+1\right] = \frac{r! (1-p)^{r-1}}{p^r}. Use this to compute E[X^2] and E[X^3].
Exercise 2.5 Supose X is binomial with parameter (n,p).
Find the value j for which P(X=j) is maximal. Hint: Compute the ratio P(X=j+1)/P(X=j).
Compute the probability that X is even. Hint: Compute (p-(1-p))^n.
3 Construction of probability measures
3.1p-systems and d-systems
Definition 3.1 (p-systems and d-systems)
\mathcal{C} is a p-system it is closed under (finite) intersections.
\mathcal{D} is a d-system if
\Omega\in \mathcal{D}
A,B \in \mathcal{D} and A \supset B \implies A \setminus B \in \mathcal{D}
A_1, A_2, \cdots \in \mathcal{D} \textrm{ with } A_n \nearrow A \implies A \in \mathcal{D}
The p stands for product (= intersection) and d stands for Eugene Dynkin who introduced that concept.
It is obvious that a \sigma-algebra is both a p-system and a d-system. The next proposition shows the converse.
Proposition 3.1 \mathcal{E} is a \sigma-algebra if and only if \mathcal{E} is a p-system and a d-system.
Proof. If \mathcal{E} is a p-system and a d-system then \Omega and \empty are in \mathcal{E} and \mathcal{E} is closed under complement. All this follows from properties 1. and 2. for d-system. Furthermore \mathcal{E} is then closed under union since A \cup B= (A^c \cap B^c)^c. Finally to extend this to countable unions for pairwise disjoiont A_i define B_n=\cup_{i=1}^n A_i and use the property 3. of d-systems.
3.2 Monotone Class Theorem
The next theorem is a version of many theorems of the same type in probability and measure theory.
Theorem 3.1 (Monotone Class Theorem) If a d-system contains a p-system \mathcal{C} then it contains the \sigma-algebra generated by \mathcal{C}.
Proof. Consider the smallest d-system \mathcal{D} containing \mathcal{C} (intersections of d-systems are d-sytems). We claim that \mathcal{D} \supset \sigma(\mathcal{C}). Since \sigma(\mathcal{C}) is the smallest \sigma-algebra containing \mathcal{C} it is enough to show that \mathcal{D} is a \sigma-algebra itself. By Proposition 3.1 we only need to show that \mathcal{D} is a p-system.
Fix B \in \mathcal{C} and consider \mathcal{D_1}=\{ A \in \mathcal{D}\,:\, A \cap B \in \mathcal{D} \}.
Note that B belongs to \mathcal{D}. We claim that \mathcal{D_1} is a d-system. Clearly \Omega \in \mathcal{D_1}. Further if A_1 \subset A_2 with both A_1,A_2 in \mathcal{D_1} then (A_2 \setminus A_1) \cap D = (A_2 \cap D) \setminus (A_1 \cap D) which belongs to \mathcal{D}. Similarly if A_n \in \mathcal{D_1} and A_n \nearrow A then (A_n \cap B) \nearrow (A \cap D) and so A \cap D \in \mathcal{D} and so A \in \mathcal{D_1}. \mathcal{D_1} is thus a d-system and it contains \mathcal{C} since B \in \mathcal{C} and \mathcal{C} is a p-system. Therefore \mathcal{D_1} \supset \mathcal{D} and we have shown that if A \in \mathcal{D} and B \in \mathcal{C} then A \cap B \in \mathcal{D}.
We now define for fixed A \in \mathcal{D} the set \mathcal{D_2}=\{ B \in \mathcal{D}\,:\, A \cap B \in \mathcal{D} \}.
One verifies that \mathcal{D_2}= is a d-system (just like for \mathcal{D_1}) and thus \mathcal{D_2} \supset\mathcal{D}. This proves that \mathcal{D} is a p-system. \quad \square
3.3 Uniqueness of Measures
It is impossible to compute P(A) for all sets. An important appliction of the the monotone class theorem is that knowing the values of P on p-system generating \mathcal{A} determines P uniquely.
Theorem 3.2 (Uniqueness of probability measures) Suppose P and Q are two probability measures on (\Omega,\mathcal{A}). If P(A)=Q(A) for all A in a p-system \mathcal{C} generating \mathcal{A} then P=Q.
Proof. We know that P(A)=Q(A) for all A \in \mathcal{C} and . Let us consider \mathcal{D} = \left\{ B \in \mathcal{A}\,:\, P(A)=Q(A)\right\}.
Clearly \mathcal{D} \supset \mathcal{C} so to use the Monotone Class Theorem we need to show that \mathcal{D} is a d-system.
Since P(\Omega)=Q(\Omega)=1 then \Omega \in \mathcal{D} and so property 1. holds.
For property 2. suppose A,B \in \mathcal{D} with A \supset B then B \setminus A \in \mathcal{D} since
P( B \setminus A) = P(B) - P(A) = Q(B)-Q(A) = Q(B \setminus A)
For property 3. if \{A_n\} \subset \mathcal{D} and A_n \nearrow A, Then P(A_n)=Q(A_n) for all n and by sequential continuity (see Theorem 1.3) they must have the same limits and thus P(A)=Q(A) and so A \in \mathcal{D}.
Corollary 3.1 If two probability P and Q coincide on the sets of the form (-\infty,a] then they are equal.
4 Measurable maps and random variables
4.1 Motivation
Given a probability space (\Omega, \mathcal{A}, P) we think of A \in \mathcal{A} as an event and P(A) is the probability to the event A occurs. Think of this an “observation”: how likely is it that the A occurs.
A random variable is a more general kind of observation. Think for example that you are performing some measurement: to an outcome \omega \in \Omega you associate e.g. number X(\omega) \in \mathbb{R}. It could also be a vector or even some more general object (e.g. a probability measure!)
Consider another state space (F,\mathcal{F}) (often we will take (\mathbb{R}, \mathcal{B}) where \mathcal{B} is the Borel \sigma-algebra) and a map
X: \Omega \to F
We will want to compute
P(\{\omega, X(\omega) \in A\})= P(X \in A) = P(X^{-1}(A))
for some A \in \mathcal{F}.
The notation X^{-1}(A) =\{ \omega\,:\, X(\omega)\in A\} is for the inverse image and for this to make sense we will need X^{-1}(A) \in \Omega.
All of this motivates the following definitions.
4.2 Measurable functions and random variables
Given a function f:E \to F and B \subset F we write
f^{-1}(B)= \left\{ x \in E \,;\, f(x) \in B \right\}
for the inverse image. The following properties are easy to verify
f^{-1}(\emptyset)=\emptyset
f^{-1}(A\setminus B) = f^{-1}(A) \setminus f^{-1}(B)
Definition 4.1 (Measurable and Borel functions) Given measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}), a function f: E \to F is measurable (with respect to \mathcal{E} and \mathcal{F}) if
f^{-1}(B) \in \mathcal{E} \textrm{ for all } B \in \mathcal{F} \,.
If F=\mathbb{R} (equipped with the Borel \sigma-algebra \mathcal{B}) a measurable functions is often called a Borel function.
Definition 4.2 (Random variable) A random variable is a measurable function
X: \Omega \to F
from a probability space (\Omega,\mathcal{A},P) to some measurable space (F,\mathcal{F}). Convention: If F=\mathbb{R} then we always take the Borel \sigma-algebra.
Remarks:
Using the letter X for a random variable is standard convention from elementary probability.
The term “random variable” is maybe a bit unfortunate but it is standard. The word “variable” means we have a function and the word “random” means it is defined on some probability space,
Compare this to the definition of continuity. A function is continuous if, for all open set, f^{-1}(O) is open.
We just say measurable if there is no ambiguity on the choice of \mathcal{E} and \mathcal{F}.
Fortunately it is enough to check the condition for a few sets
Proposition 4.1 f: E\to F is measurable with respect to \mathcal{E} and \mathcal{F} if and only if
f^{-1}(B) \in \mathcal{E} \quad \textrm{ for all } \quad B \in \mathcal{C}
where \mathcal{C} generates \mathcal{F} (i.e. \sigma(\mathcal{C})=\mathcal{F}).
Proof. Consider the family of sets
\mathcal{D} = \left\{ B \in \mathcal{F}\,:\, f^{-1}(B) \in \mathcal{E} \right\}
We now that \mathcal{D} \supset \mathcal{C} and that \sigma(\mathcal{C}) = \mathcal{F}.
To conclude it is enough to show that \mathcal{D} is a \sigma-algebra becuase if this true \mathcal{D} \supset \mathcal{C} implies \mathcal{D} \supset \sigma(\mathcal{C})=\mathcal{F}.
Showing that \mathcal{D} is a \sigma-algebra is easy using the rules for inverse images in Section 4.2.
Corollary 4.1 A function from (E,\mathcal{E}) to \mathbb{R},\mathcal{B} is measurable if and only if
f^{-1}((-\infty, a]) = \{x\in E\,:\, f(x) \le a\} \in \mathcal{E}
that this the level sets of the function f need to be measurable sets
4.3 Operations on measurable functions
Composition of functions
f: E \to F, \quad \quad g: F \to G, \quad \quad g\circ f: E \to G
Like continuity is preserved by composition so is measurability.
Theorem 4.1 (Composition preserves measurability) If f:E \to F is measurable (w.r.t. \mathcal{E} and \mathcal{F}) and g:F \to G is measurable (w.r.t. \mathcal{F} and \mathcal{G}) then the composition h = g \circ f is measurable (w.r.t \mathcal{E} and \mathcal{G}).
Proof. If C \in \mathcal{G} then (g \circ f)^{-1}(C) = f^{-1}( g^{-1}(C)). By the measurability of g, g^{-1}(C) \in \mathcal{F} and so by the measurability of f, f^{-1}( g^{-1}(C)) \in \mathcal{E}.
Given a function f:E \to \mathbb{R} we define positive/negative parts
f_+ = f \vee 0, \quad f_- = -( f \wedge 0) \quad \implies \quad f=f_+-f_- ,\quad |f|=f_++f_-
Theorem 4.2 f: E \to \mathbb{R} is measurable iff and only if f_+ and f_- are measurable.
Proof. It is enough to consider sets of the form \{x, f(x)\le a\}. Proof in your homework.
4.4 Simple functions
Definition 4.3 (Simple functions)
Given a set A \in \mathcal{E}, the indicator function1_A is defined as
1_A(x) = \left\{ \begin{array}{cl} 1 & \textrm{ if } x \in A \\
0 & \textrm{ otherwise} \end{array} \right.
A simple functionf is a function of the form
f(x) = \sum_{i=1}^n a_i 1_{A_i}(x)
for some finite n, real numbers a_i, and measurable sets A_i.
Remarks
The A_i are not necessarily disjoint.
A function is simple if and only if it takes finitely many different values.
The decomposition is not unique.
Definition 4.4 A simple function is in canonical form if
f(x)= \sum_{i=1}^m b_i 1_{B_i}(x)
where b_i are all distinct and (B_i)_{i=1}^m form a partition of E.
Remark: One can always rewrite a simple function in canonical form if needed. Just make a list of the values the function takes b_1, b_2, \cdots, b_m and set B_i=\{x, f(x)=b_i\}.
Proposition 4.2 If f and g are simple function then so are
f+g,\quad f-g, \quad fg, \quad f/g, \quad f\vee g, \quad f\wedge g
Proof. The simplest way to see this is to note that each of these functions takes at most finitely many values if f and g does and therefore they must be simple functions.
4.5 Supremum, infimum, limits
As we see next measurability is preserved by basic operations, in particular taking limits.
Refresher on \limsup and \liminf of sequences: Recall the definitions of \liminf and \limsup for sequences of real numbers (they always exists if we allow the values \pm \infty.)
\liminf_n a_n = \sup_{n} \inf_{m \ge n} a_m = \lim_{n} \inf_{m \ge n} a_m = \textrm{ smallest accumulation point of } \{a_n\}
\limsup_n a_n = \inf_{n} \sup_{m \ge n} a_m = \lim_{n} \sup_{m \ge n} a_m =
\textrm{ largest accumulation point of } \{a_n\}
Theorem 4.3 Suppose f_n: E \to \overline{\mathbb{R}}, n=1,2,\cdots is a sequence of measurable functions (with respect to \mathcal{E} and the Borel \sigma-algebra). Then the functions
\inf_n f_n, \quad \quad \sup_n f_n\,,\quad \quad \liminf_n f_n\,, \quad \quad \limsup_n f_n,
are measurable.
If f=\lim_{n}f_n exsts then f is measurable
Proof.
Let us write g=\sup_{n} f_n. It is enough to check that \{g \le a\} is measurable for any a. We have
\{ g \le a \} =\{ f_n \le a \text{ for all } n\} = \bigcap_{n} \{ f_n \le a\}\,.
So \inf_n f_n is measurable if each f_n is measurable.
For g=\inf_{n} f_n we could use that the Borel \sigma-algebra is generated by the collection \{ [a,+\infty) \,:\, a \in \mathbb{R}\} and
\{ g \ge a \} =\{ f_n \ge a \text{ for all } n\} = \bigcap_{n} \{ f_n \ge a\}\,.
Since \limsup and \liminf are written in terms of \inf and \sup they do preserve measurability.
If f=\lim_n{f_n} exists then f=\lim_n{f_n} = \limsup_n f_n = \liminf_n f_n and thus is measurable.
4.6 Approximation by simple functions
The following theorem is very important, because it reduces many a computation about measurable function to a computation about a simple function and then taking a limit. In that context one also uses all the time that any measurable f is the difference of two positive measurable functions.
Theorem 4.4 (Approximation by simple functions) A nonnegative function f:E\to \mathbb{R}_+ is measurable \ifff is the limit of an increasing sequence of positive simple functions.
d_n = \sum_{k=1}^{n2^n} \frac{k-1}{2^n}
1_{
[\frac{k-1}{2^n},\frac{k}{2^n})
}
+ n 1_{[n,\infty)}
Simple function, right continuous, d_n(x) \nearrow x on [0,\infty)
Proof. It is not difficult to see that that the function d_n given in the previous page is increasing (due to the dyadic decomposition) and d_n(x) \nearrow x as n \to \infty since if x \in [\frac{k-1}{2^n},\frac{k}{2^n}) then |x-d_n(x)| \le \frac{1}{2^n}.
Let f be a non-negative measurable function then the function
g_n = d_n \circ f
is a measurable functions (as a composition of measurable functions) and it is a simple function because d_n \circ f takes only finitely many values. Since d_n is increasing and f(x)\ge 0, d_n(f(x)) \nearrow f(x). \quad \square
Corollary 4.2 (Approximation by simple functions) A function f:E\to \mathbb{R} is measurable if and only if it can be written as the limit of sequence of simple functions.
Proof. Write f=f_+-f_- and apply Theorem 4.4 to f_\pm. \quad \square
Theorem 4.5 Suppose f and g are measurable then
f+g, \quad f-g, \quad fg, \quad f/g ( \text{ if } g(x) \not=0 )
are measurable
Often it is useful to consider function which are allowed to take values \pm \infty.
The Borel \sigma-algebra on \overline{\mathbb{R}} consists of all sets of the form A, A \cup \{-\infty\}, A \cup \{\infty\}, A \cup \{-\infty, \infty\}.
This Borel \sigma-algebra is generated by the intervals of the form \{[-\infty, r]\}.
All properties of measurable functions on f: E \to \mathbb{R} extend to functions f: E \to \overline{\mathbb{R}}: approximation by simple functions, supremeum, infimum, etc…
We will use all this whenever we need it.
4.8 Homework problems
Exercise 4.1 Show that f is measurable if and only if f+ and f_- are measurable.
Exercise 4.2 Suppose f: \mathbb{R} \to \mathbb{R} (both equipped with Borel \sigma algebra) is continuous. Show that f is measurable. Remark: This also holds for any continuous function between arbitrary topological spaces.
Exercise 4.3
Suppose f: \mathbb{R} \to \mathbb{R} (both equipped with Borel \sigma algebra) is a right-continuous step function, if there exists a (finite or countable) collection of intervals I_n=[t_n, s_n) such that f is constant on I_n and \cup_n I_n = \mathbb{R}. Show that such a function is measurable.
A function f: \mathbb{R} \to \mathbb{R} is right continuous if f(x_n)\to f(x) for any decreasing sequence x_n \searrow x and this holds for every x. Show that such a function is measurable. Hint: Set c_n = \sum_{k=1}^{\infty} \frac{k}{2^n} 1_{ [\frac{k-1}{2^n},\frac{k}{2^n})} and f_n =f \circ c_n.
Exercise 4.4 Suppose f: \mathbb{R} \to \mathbb{R} is increasing. Show that f is measurable.
Exercise 4.5 Given two measurable function f,g from (E,\mathcal{E}) to (\mathbb{R}, \mathcal{B}). Show that the sets
\{f \le g \}, \quad \{f < g \}, \quad \{f = g \},\quad \{f\not =g\}
are all measurable.
Exercise 4.6 Suppose (E, \mathcal{E}) and (F,\mathcal{F}) are two measurable spaces. A (measurable) rectangle in E\times F is a set of the form
A \times B \quad A \in \mathcal{E}, B \in \mathcal{F}.
The product \sigma-algebra \mathcal{E} \otimes \mathcal{G} is defined as the \sigma-algebra generated by all measurable rectangles.
Suppose f: E \to F is measurable (with respect to \mathcal{E} and \mathcal{F}) and g: E \to G is measurable (with respect to \mathcal{E} and \mathcal{G}). Show that the function h: E \to F\times G given by h(x)=(f(x),g(x)) is measurable (with respect to \mathcal{E} and \mathcal{F}\otimes \mathcal{G}).
Suppose f: E \times F \to G is measurable (with respect to \mathcal{E}\otimes \mathcal{F} and \mathcal{G}). For any fixed x_0\in E define the section of f as the function
h: F \to G \quad \text{ with } h(y)=f(x_0,y)
Show that h is measurable. Hint: Show first that the map g:Y \to X\times Y given by g(y)=(x_0,y) is measurable.
5 Distribution functions and quantile functions
5.1 Random variables
Let us apply what we have learned in the last sections to random variables
X : \Omega \to \mathbb{R}
where (\Omega, \mathcal{A}, P) is a probability space.
Theorem 5.1 Suppose f: E \to F is a measurable map between the measurable spaces (E, \mathcal{E}) and (F,\mathcal{F}) and P a probability measure on (E, \mathcal{E}). Then
f^{-1}(F)= \left\{ f^{-1}(B), B \in \mathcal{F} \right\} is a \sigma-algebra, in general a sub \sigma-algebra of \mathcal{E}.
P\circ f^{-1}(B) which is defined has P\circ f^{-1}(B) = P( f^{-1}(B)) = P(\{x\,:\, f(x) \in B \}) is a probability measure on (F,\mathcal{F}).
Proof. Check the axioms.
Definition 5.1 (Image of a measure) The measure P \circ f^{-1} is called the image of the measure P under f. Various other notations are used (such as f_\#P, etc…)
Adding some terminology
Definition 5.2 (The \sigma-algebra generated by a random variable X) Given a random variable X : \Omega \to \mathbb{R} defined on the probability space (\Omega, \mathcal{A}, P), the \sigma-algebra generated by a random variable X is the \sigma-algebra X^{-1}(\mathcal{B}) \subset \mathcal{A}.
The interpretation is that this \sigma-algebra contains all the “information” you can extract from the probability measures P simply by using the random variable X. This will play an increasingly important role in the future!
Definition 5.3 (Distribution of a random variable X) Given a random variable X : \Omega \to \mathbb{R} defined on the probability space (\Omega, \mathcal{A}, P), the distribution of the random variable X is the probability measure P^X given by
P^X \equiv P\circ X^{-1}
defined on (\mathbb{R}, \mathcal{B}). That is we have
P^X(B) = P(X \in B).
5.2 Cumulative distribution function
By Corollary 3.1, probability on \mathbb{R} are uniquely defined by their values on the intervals (-\infty,x], this justify the following definition
Definition 5.4 (Cumulative distribution function) The cumulative distribution function (CDF) of a random variable X is the function F: (-\infty,\infty) \to [0,1] defined by
F_X(t) = P\{ X \le t\} = P^X((-\infty,t])
Theorem 5.2 (Properties of CDF) If the function F(t) is the CDF for some random variable X, then F has the following properties
F is increasing.
\lim_{t \to -\infty} F(t)=0 and \lim_{t \to +\infty} F(t)=1
F is right-continuous: for every t, F(t)= F(t+)\equiv \lim_{s \searrow t} F(s).
Proof. Item 1. is the monotonicity property for the probability measure P^X. Item 2. follows from sequential continuity and from the fact that (-\infty,t] \searrow \emptyset as t \searrow -\infty and so F(t) \searrow P^X(\emptyset) =0. A similar argument works for t \nearrow \infty. Item 3. follows also from sequential continuity since as s \searrow t, (-\infty,s] \searrow (-\infty,t].
Remarks:
Note that F is in general not (left)-continuous. Indeed if s \nearrow t then (-\infty, s) \nearrow (-\infty,t) and P^X( (-\infty,t] )=P^X( (-\infty,t) ) + P^X( \{t\}). We denote the left limit by F(t-).
One can compute probabilities using the CDF. For example
P( a < X \le b ) = F(b)-F(a)
P( a \le X \le b ) = F(b) - F(a-)
P( X=b ) = F(b)-F(b-)
A atom for a probability measure P on a set \Omega is an element \omega \in \Omega such that P(\{\omega\})>0.
The distribution P^X of the random variable X has atoms whenever the CDF is discontinuous (i.e. F_X(t-)\not=F_X(t)).
The distribution P^X of the random variable X has at most countably many atoms. (Why? see homework)
A discrete random variable X taking values \cdots < \omega_{-1} < \omega_0 < \omega_1 < \cdots has a purely atomic distribution P^X.
The CDF is piecewise constant and we have
F_X(t)= \sum_{n \,:\, \omega_n \le t } P(\{\omega_n\})
Another way to define a CDF is to use a PDF (=probability density function).
Definition 5.5 (Probability density function) A probability density function (PDF) is a function f: \mathbb{R} \to \mathbb{R} such that
f(t) \ge 0, f is non-negative
\int_{-\infty}^\infty f(t) dt = 1, f is normalized
The corresponding CDF is then given by the integral
F(t)= \int_{-\infty}^t f(x) dx
For now think of the integral as a Riemann integral (e.g. f is piecewise continuous). We will revisit this later when equipped with better integration tools. Many of the classical distributions in probability are given by densities. Here are some examples which will come back.
Examples of PDF:
Uniform RV on [a,b]: This random variable takes values uniformly distributed in the interval [a,b].
f(x)=
\left\{
\begin{array}{cl}
\frac{1}{b-a} & a \le x \le b \\
0 & \text{otherwise}
\end{array}
\right.
\quad \quad
F(x)=
\left\{
\begin{array}{cl}
0 & x \le a \\
\frac{x-a}{b-a} & a \le x \le b \\
1 & x \ge b
\end{array}
\right.
Exponential RV with parameter \beta:
f(x)=
\left\{
\begin{array}{cl}
0 & x \le 0 \\
\beta e^{-\beta x} & x \ge 0
\end{array}
\right.
\quad \quad
F(x)=
\left\{
\begin{array}{cl}
0 & x \le 0 \\
1 - e^{-\beta x} & x \ge 0 \\
\end{array}
\right.
Gamma RV with parameters (\alpha,\beta):
Weibull distribution with parameters (\alpha,\beta):
Normal distribution with parameters (\mu, \sigma^2): The normal distribution has parameter \mu \in \mathbb{R}
f(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \quad \quad F(x)=\int_{-\infty}^x f(t) \, dt
Log-normal distribution parameters (\mu, \sigma^2):
Laplace distribution with parameters (\alpha,\beta): This is a 2-sided and shifted version of the exponential distribution.
f(x)= \frac{\beta}{2} e^{-\beta|x-\alpha|}
Cauchy distribution with parameters (\alpha,\beta): This is an example of distribution without a finite mean
f(x)= \frac{1}{\beta \pi} \frac{1}{1+(x-\alpha)^2/\beta^2} \quad \quad F(x) =\frac{1}{\pi} \arctan\left( \frac{x-\alpha}{\beta}\right) + \frac{1}{2}
Pareto distribution with paramters (x_0,\alpha):
5.3 Random variables with mixed distribution.
It easy to build a random variable whose distribution is neither discrete nor continous.
Example: Flip a fair coin. If the coins lands on tail you win a prize X uniformly distributed on [0,1] and if the coins lands on tail you loose. Then X has an atom at 0 and
F(x)=
\left\{
\begin{array}{cl}
0 & x < 0 \\
\frac{1}{2} + \frac{1}{2} x & 0 \le x < 1 \\
1 & x \ge 1
\end{array}
\right.
Definition 5.6 (Mixtures of Random variables) Suppose X_1, X_2, \cdots, X_m are random variables with CDF F_{X_1}(t) and \alpha=(\alpha_1, \cdots, \alpha_m) is such that \alpha_i \ge 0 and \sum_{i=1}^m \alpha_i =1. Then
\sum_{i=1}^m \alpha_i F_{X_i}(t)
is a CDF of a random variable X which is called the (\alpha_1, \cdots, \alpha_m) mixture of X_1, X_2, \cdots, X_m.
In the previous example we had a (1/2,1/2) mixture of X_1=0 (a discrete RV) and X_2 a uniform RV on [0,1] (a continuous RV).
5.4 Devil’s staircase
We construct here a CDF with remarkable properties
F(t) has no discontinuities (no atoms)
F(t) does not have a density, that is F(t) cannot be written as F(t)=\int_0^x f(t) dt.
The construction is based on the Cantor set and F is defined iteratively.
Set F_0(t)=t
Define the function F_1 to be equal to \frac{1}{2} on [1/3, 2/3] continuous and linear [0,1] with F(0)=0 and F(1)=1. Then we have |F_1(t)-F_0(t)|< \frac{1}{2}.
In the second step, let F_2 to be equal to \frac{1}{4} on [1/9, 2/9], unchanged on [1/3, 2/3], \frac{3}{4} on [1/9, 2/9], continuous and piecewise linear [0,1] with F(0)=0 and F(1)=1. We have |F_2(t)-F_1(t)|< \frac{1}{4}.
Repeat the procedure now on the interval [1/27, 2/7], [7/27, 8/27], [19/27, 20/27], [25/27, 26/27]….
It is not diificult to see, by induction, that |F_{n}(t)-F_{n-1}(t)| \le \frac{1}{2^{n}} and thus the sequence F_n converges uniformly to a continuous function F(t) which is increasing on [0,1]
The function F(t) is CDF in good standing. We have P([1/3,2/3])=0 as well as P([1/9,2/9])=P([7/9,8/9])=0$ and so on. In particular there are 2^{n-1} intervals of lengths 3^n whose probability vanishes. The total lenghts of all the interval on which the probability vanish is \frac{1}{3} + 2 \times \frac{1}{9} + 4 \frac{1}{27} =\sum_{n=0}^\infty \frac{2^{n-1}}{3^n} = 1.
A random variable X with CDF F(t) is neither continuous (in the sense of having a density), nor discrete.
5.5 Quantile functions
Intuitively a p-quantile for a RV X, where p \in (0,1), is a value t \in \mathbb{R} where the probability that F_X(t)=P(X\le t) reaches (or crosses over) p. For p=\frac{1}{2} it is usually referred to as the median. More formally
Definition 5.7 (Quantiles of a RV X.) For p \in (0,1), a p-quantile for the RV X is a value t\in \mathbb{R} such that
P(X < t)=F_X(t-) \le p \quad \text { and } \quad P(X\le t) =F_X(t) \ge p
Remark: Various cases are possible
a is the unique p-quantile for p (F_X is strictly increasing at a)
b is the unique q-quantile (but there is an whole interval of q which share the same quantile b!).
The interval [c,d] are all r- quantiles (because F_X is locally constant).
We now make a choice to make it unique (other conventions occur in the literature).
Definition 5.8 (Quantile function for a random variable X) For a RV X with CDF F(t) we define the quantile function of X, Q:[0,1] \to \mathbb{R} as
Q(p) = \min\{ t\,:\, F(t) \ge p \}
with the convention that \inf \emptyset = + \infty
Remark:
Q is well defined since F being increasing and right-continuous implies that
\{t\,:\,F(t)\ge p\}=[a,\infty)
and thus the mimimum exists.
Q(p) is a p-quantile since if s=Q(p) then F(s)\ge p and, for any t< s, F(t)<p. Therefore F(s-)\le p. In fact this shows that Q(p) is the smallestp-quantile of X.
If we had picked \widetilde{Q}(t)=\inf \{ t\,:\, F(t) > p \} this would have given us the largest p-quantile (a fine choice as well).
Theorem 5.3 (Properties of the quantile function) The quantile function Q(p) satisfies the following properties
Q(p) is increasing.
Q(F(t)) \le t.
F(Q(p)) \ge p.
Q(p-)=Q(p) and Q(p+) exists. That is Q is left continuous.
Proof.
If p \le q then F increasing implies that \{t\,:\, F(t) \ge q \} \subset \{t\,:\, F(t) \ge p \} and this implies that Q(q)\ge Q(p).
By definition Q(F(t)) is the smallest s such that F(s)\ge F(t). Thus Q(F(t))\le t.
Q(p) is a value s such that F(s)\ge p and thus F(Q(p))\ge p.
This holds because F is right-continuous.
5.6 Application: Construction of probability measures on \mathbb{R}
The most important property of quantile is the following property which shows that Q is a form of functional inverse for F.
Theorem 5.4 We have
Q(p) \le t \iff p \le F(t)
Proof.
If Q(p) \le t then since F is increasing F(Q(p)) \le F(t). But by Theorem 5.3, item 2. F(Q(p))\ge p and thus p \le F(t).
Conversely if p \le F(t) then, since Q is increasing, Q(p) \le Q(F(p)) \le p where the last inequality is from Theorem 5.3, item 3.
\quad \square.
We turn next to constructing all probabilities on \mathbb{R}. To do this we first need to construct at least one.
Theorem 5.5 (Lebesgue measure on [0,1]) There exists a unique probability measure P_0 on [0,1] with its Borel \sigma-algebra such that
P([a,b])=b-a
The measure P_0 is the distribution of the uniform random variable on [0,1] with PDF
f(t)= \left\{
\begin{array}{cl}
1 & 0 \le x \le 1 \\
0 & \text{otherwise}
\end{array}
\right.
and CDF
F(t) =\left\{
\begin{array}{cl}
0 & x\le 0 \\
x & 0 \le x \le 1 \\
1 & x \ge 1
\end{array}
\right.
Proof. Go and take Math 623….
Equipped with this we can now prove
Theorem 5.6 Any probability measure P on \mathbb{R} has the form
P= P_0 \circ Q^{-1}
where P_0 is the Lebesque measure on [0,1] and Q is the quantile function for F(t)=P((-\infty,t]).
Proof. By definition of the image measure (see) Theorem 5.4), P is a probaility measure, and from the fact that P_0( [0,a])=a we get
\begin{aligned}
P_0 \circ Q^{-1}( (-\infty, t]) & = P_0( \{p\,:\, Q(p) \le t\} ) \\
& = P_0( \{p\,:\, p \le F(t) \}) \\
& = F(t)
\end{aligned}
and we are done since the CDF determines the measure P uniquely. \quad \square
Another way to interpret this result is that we have constructed a probability space for any RV with a given CDF. Namely we constructed a probability space (here (\Omega,\mathcal{A},P)=([0,1],\mathcal{B},P_0)) (here P_0) is the Lebesgue measure on [0,1] and a map X=Q (the quantile function) with Q:[0,1] \to \mathbb{R}.
5.7 Simulation
In computers are built-in random number generators which generate a uniform RV on [0,1], that a RV whose distribution is P_0.
Inverse method to generate Random Variables:
To generate a RV X with PDF F(t):
Generate a random number U. If U=u
Set X=Q(u) where Q is the quantile function for X.
Example:
If X has an exponential distribution, then F(t)= \int_0^t \lambda e^{-\lambda s} ds = 1-e^{-\lambda t} and Q(p)= -\frac{1}{\lambda} \ln(1 -p)
If X is uniform on \{1,2,\cdots,n\} then the quantile function is the function Q(p)=\lceil np\rceil. Recall \lceil x\rceil is the smallest integer equal or greater than x.
If X is a normal RV then the CDF is F(t)=\int_{-\infty}^t \frac{e^{-x^2/2}}{\sqrt{2\pi}} dx. The quantile Q=F^{-1} has no closed form, but there exists excellent numerical routine to compute it. This can be used to generate normal random variables.
The inverse methods has its limitation and we will learn other simulation methods later on.
Code
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import ndtri # quantile for the normal RVuniform = np.random.rand(1000000) # generate random numbersdataexponential =- np.log(1-uniform) # quantile for the exponentialdatanormal = ndtri(uniform) # quantile for the normal datadiscreteuniform10 = np.ceil (10*uniform) # quantile for uniform# Create a histogramhist, bin_edges = np.histogram(datadiscreteuniform10, bins=100, density=True) # Adjust the number of bins as needed# Calculate the PDF from the histogrambin_width = bin_edges[1] - bin_edges[0]pdf = hist * bin_width# Plot the empirical PDFplt.bar(bin_edges[:-1], pdf, width=bin_width, alpha=0.5)plt.xlabel('X-axis')plt.ylabel('PDF')plt.title('Empirical Probability Density Function')plt.show()
5.8 Exercises
Exercise 5.1
Suppose Y is a real-valued random variable with a continuous cdf F_Y(t) and probability distribution P_Y on (\mathbb{R},\mathcal{B}). Show that the random variable U=F_Y(Y) has a uniform distribution P_0 on [0,1] (i.e. Lebesgue measure).
In Theorem 5.6, using the quantile Q, given a CDF F we constructed a random variable X:([0,1],\mathcal{A},P_0) \to (\mathbb{R},\mathcal{B}) (P_0 is Lebesgue measure) whose CDF is F. In other words we showed Q(U) has CDF F.
Use this fact and part 1. to construct a random variable X':([0,1],\mathcal{B},P_Y) \to (\mathbb{R},\mathcal{B}) such that is CDF is F.
Exercise 5.2 Show that the function
X(\omega)= \left\lceil \frac{\ln(1-\omega)}{\ln(1-p)} \right\rceil
defines a geometric random variable with success probability p on the probability space (\Omega, \mathcal{A}, P_0) (where P_0 is Lebsegue measure. (Or in other words if U is uniform on [0,1] then \left\lceil \frac{\ln(1-U)}{\ln(1-p)} \right\rceil has a geometric distribution, which provides an easy way to generate geometric random variables on a computer. Hint: What are the CDF of expomential and geometric random variables?
Exercise 5.3 Some notations:
A probability measure P on the measurable space (\Omega, \mathcal{A}) is called diffuse if P has no atoms.
Two probability measures P and Q on (\Omega, \mathcal{A}) are called singular if we can partition \Omega=\Omega_P\cup \Omega_Q (with \Omega_P\cap \Omega_Q) =\emptyset) such that P(\Omega_P)=1 and Q(\Omega_Q)=1.
The set of all probbaility measures on (\Omega, \mathcal{A}) is denoted by \mathcal{P}(\Omega). It is a convex set: if P,Q \in \mathcal{P}(\Omega) then R=\alpha P + (1-\alpha)Q \in \mathcal{P}(\Omega) for any \alpha \in [0,1]. We say then that R is a mixture of P and Q.
Show the following
Show that any probability measure can be decomposed as a mixture of one singular atomic and one diffuse measure.
Suppose P is a probability measure on (\mathbb{R},\mathcal{B}) with CDF F(t). Describe the decomposition of the measure P into an atomic and diffuse measure in terms of the CDF F.
Suppose P is a diffuse measure on (\mathbb{R},\mathcal{B}) and A \subset \mathbb{R} is any subset with P(A)>0. Show that for any 0\le t \le 1 there exists a set B_t \subset A such that P(B_t)=tP(A). Hint: Let B_t=A \cap(-\infty,t]. Study the function h(t)=P(B_t).
Exercise 5.4
Prove that the Cantor function (a.k.a devil’s staircase) given in Section 5.4 is continuous and that this defines a diffuse probability measure P.
Let C be the Cantor set obtained by removing from [0,1] the intervals (1/3,2/3) and (1/9, 2/9)(7/9,8/9) and so on. If P_0 is the Lebesgue measure on [0,1], show that P_0(C)=0 and that yet C has the same cardinality as [0,1]. Hint: One option is to use the Cantor function.
Show that the Lebesgue measure on [0,1], the Cantor measure, and any atomic measure are singular.
Exercise 5.5 In this problem I would like you to write at least a pseudo-code (better write a code and run it, modify the code on the previous page for example and visualize your result). We suppose the quantile function of the normal random variable with parameter (\mu,\sigma)=(0,1) is known. For example in python
from scipy.special import ndtri # quantile for the normal RV
Calling random numbers (as many as needed) and using the quantile function ndtri write a code which generates a mixture of 3 normal random variables with parameters
( \mu_1, \sigma_1) = (-2,.4),\quad (\mu_2, \sigma_2)=(0,.3),\quad (\mu_3, \sigma_3)=(3,1)
with mixing parameters (2/7,4/7,1/7).
6 Integration with respect to a probability measure
Given a probability space (\Omega, \mathcal{A}, P) and a random variable X: \Omega \to \mathbb{R} how do we define the expectation of X for general random variables?
There are 2 parts in the theory. A general theory using the measure P from which we deduce a more practical way which uses the probability P^X on \mathbb{R} (the only thing we really know how to handle …)
6.1 Definition of the expectation (a.k.a the integral)
We start by giving a definition of expectation for an arbitrary random variables. The definition is a bit rigid and may seem at first sight slighlty arbitrary but subsequent analysis will show that this is a good choice.
Definition 6.1 (Definition of expectation) Let (\Omega, \mathcal{A},P) be a probability space.
Suppose X is a simple RV (i.e., it takes finitely many values) then X=\sum_{j=1}^M b_j 1_{B_j} (in canonical form!). We define
E[X]= \sum_{j=1}^M b_j P(B_j)
\tag{6.1}
Suppose X is an arbitrary non-negative RV (i.e. X(\omega)\ge 0 for all \omega \in \Omega.) Then using the functions d_n given in Theorem 4.4 consider the simple RV X_n=d_n \circ X and define
E[X]= \lim_{n \to \infty} E[X_n] \quad \text{where the limit allowed to be} +\infty
\tag{6.2}
For an arbitrary RV X, write X=X_+-X_- and define
E[X]=
\left\{
\begin{array}{cl}
E[X_+] - E[X_-] & \text{if } E[X_+] < \infty \text{ or } E[X_+] < \infty\\
\text{undefined} & \text{if } E[X_+] = \infty \text{ and } E[X_+] = \infty
\end{array}
\right.
\tag{6.3}
Remarks Let us make a number of comments on the definition.
If the simple RV is not in canonical form, i.e. X=\sum_{i=1}^N a_i 1_{A_i}. Then E[X]=\sum_{n}a_i P(A_i).
The preceeding remark implies that if X and Y are simple random variables then E[X +Y]=E[X]+ E[Y] (linearity of expectation).
If X and Y are simple and nonnegative and X \le Y then E[X]\le E[Y]. This follows from the linearity by writing Y= X + (Y-X) and so E[Y] =E[X] + E[Y-X]. If Y-X \ge 0 then E[Y-X]\ge 0 and so E[X]\le E[Y].
The function d_n are increasing in n, d_n(x) \le d_{n+1}(x) and this implies that X_n \le X_{n+1} and thus E[X_n] \le E[X_{n+1}].
Therefore the limit in Equation 6.2 always exists but could well be equal to +\infty.
The definition in item 2. seems somewhat arbitrary since it is using a particualr choice of simple function d_n. We will show soon that this choice actually does not matter.
For general X we allow the expectation to equal to +\infty (if E[X_+]=\infty] and E[X_-]<\infty]) or (-\infty if E[X_+]<\infty] and E[X_-]=\infty]). If both E[X_+]=\infty] and E[X_-]=\infty] the expectation is undefined.
If X:\omega \to \overline{\mathbb{R}} is is extended real-valued we can still define expectation in the same way.
Definition 6.2 A measurable function is integrable if E[X] < \infty or equivalently if E[|X|] < \infty or equivalently if E[X_\pm]<\infty.
6.2 Monotone Convergence
We extend monotonicity to general non-negative RVs.
Theorem 6.1 (Monotonicity) If X \ge 0 then E[X] \ge 0. If 0 \le X \le Y then E[X]\le E[Y].
Proof. If X \ge 0 so is X_n=d_n\circ X and therefore E[X]\ge 0. If 0 \le X \le Y then X_n \le Y_n and so E[X_n]\le E[Y_n] and thus E[X]\le E[Y].
The next theorem (Monotone convergence Theorem) is very useful in itself and, in addition, the other convergence theorems for expectations derive from it.
Theorem 6.2 (Monotone Convergence Theorem) Suppose X_n are non-negative and increasing: 0\le X_n(\omega) \le X_{n+1}(\omega). Then X(\omega)=\lim_{n \to \infty}X_n(\omega) exists and
\lim_{n \to \infty} E[X_n] = E[X]= E[\lim_{n\to \infty} X_n ]
Proof. Since X_n(\omega) is an increasing the sequence, the limit X(\omega) \in \overline{\mathbb{R}} exists and E[X] exists. By monotonicity, see Theorem 6.1, we have X_n \le X_{n+1} \le X and therefore \lim_{n \to \infty} E[X_n] exists and we have
\lim_{n \to \infty} E[X_n] \le E[X]\,.
We need to show the reverse inequality: \lim_{n \to \infty} E[X_n] \ge E[X]. To prove this we need to show the following claim.
Claim: Suppose Y is simple and Y \le X then \lim_{n \to \infty} E[X_n] \ge E[Y].
Indeed if the claim is true \lim_{n \to \infty} E[X_n] \ge E[d_k \circ X ] for all k and taking the limit k \to \infty concludes the proof.
To prove the claim take b \ge 0 and consider the set B=\{ X > b\} and set B_n =\{ X_n >b \}. Since B_n \nearrow B we have P(B_n) \to P(B) by sequential continuity. Furthermore
X_n 1_{B} \ge X_n 1_{B_n} \ge b 1_{B_n}
which implies, by monotonicity, that E[X_n 1_{B}] \ge b P(B_n) and taking n \to \infty we obtain
\lim_{n \to \infty} E[X_n 1_B]\ge b P(B).
\tag{6.4} Now this inequality remains true if we consider the set \overline{B}=\{ X \ge b\} instead of B. To see this, take an increasing sequence b_m \nearrow b so that \{X >b_m\} \nearrow \{X \ge b\}. Indeed apply Equation 6.4 (with b replaced by b_m) and then used monotonicity.
To conclude note that if Y=\sum_{i=1}^m a_i 1_{A_i} (in canonical form) and X\ge Y then X\ge a_i on A_i. By finite additivity, using Equation 6.4, we have
\lim_{n \to \infty} E[X_n] = \sum_{i=1}^m \lim_{n \to \infty} E[X_n 1_{A_i}] \ge \sum_{i=1}^m a_i P(A_i) = E[Y]
and this concludes the proof. \quad \square
6.3 Further properties of the expectation
Remark: The monotone convergence theorem shows that if X_n is any sequence of simple function increasing to X then E[X]=\lim_n E[X_n].
Theorem 6.3 (Linearity of Expectation) If X and Y are integrable random variable then for any a, b \in \mathbb{R} we have
E[a X + bY] = a E[X] + b E[Y]
Proof. In a first step assume that a,b,X,Y are all nonnegative. If X and Y are simple this is true by the remarks after Definition 6.1. For general X and Y pick X_n and Y_n simple functions which increase to X and Y respectively (e.g. X_n=d_n\circ X or Y_n=d_n\circ X). Then
E[a X_n+bY_n]=a E[X_n]+ bE[Y_n].
Now by the Monotone Convergence Theorem aX_n + bY_n increases to aX+bY and thus taking n \to \infty concludes the proof.
6.4 Negligible sets
Definition 6.3
A measurable set A \in \mathcal{A} is negligible with respect to P (or a null set for P) if P(A)=0.
A set A (not necessarily measurable) is negligible with respect to P if there exists B \in \mathcal{A} such that A \subset B and P(B)=0 (i.e. A is a subset of set of meaasure 0).
It is a fine point of measure theory that negligible set need not be measurable. This is true for example for the Borel \sigma-algebra for example. There is a standard procedure (called the completion of a probability space) which extends the \sigma-algebra and the probability measure P such that all negligible sets are measurable. The idea is to define, with \mathcal{N} denoting all the null sets of \mathcal{A}, a new \sigma-algebra
\overline{\mathcal{A}} =\{ A\cup N\,:\, A \in \mathcal{A}, N\in \mathcal{N} \}
and a new probability measure
\overline{P}(A \cup N) = P(A)\,.
The probability space (\Omega, \overline{\mathcal{A}}, \overline{P}) is called the completion of (\Omega, \mathcal{A}, P).
For example the completion of the Borel \sigma-algebra on [0,1] with the Lebesgue measure is called the Lebesgue \sigma-algebra. This does not play much of a role, but at a few occasions it will be convenient to assume that the space is complete.
Almost sure properties:
We say that two RVs X and Y are equal almost surely if
P(X=Y)= P(\{\omega\,:\, X(\omega)=Y(\omega)\})=1
that is X and Y differ on a negligible set. We write X=Y a.s.
If X=Y almost surely then E[X]=E[Y]. Indeed then the simple approximations satisfies X_n=Y_n almost surely. If two random variables are equal almost surely then their expectations are equal (use the canonical form).
We say, for example, that X \ge Y a.s if P(\{\omega\,:\, X(\omega)\ge Y(\omega)\}=1.
We say X_n converges to X almost surely if there exists a negligible sets N such that for all \omega \in \Omega \setminus N we have \lim_{n}X_n(\omega)=X(\omega).
Theorem 6.4 Suppose X\ge 0. Then E[X]=0 if and only X=0 a.s
Proof. If X=0 a.s. then E[X]=0 because E[0]=0. Let A_n=\left\{\omega : X(\omega)\ge \frac{1}{n} \right\}. Then X \ge X 1_{A_n} \ge \frac{1}{n} 1_{A_n} and thus by monotonicity
0=E[X]\ge E[X 1_{A_n}]\ge \frac{1}{n}P(A_n)
and thus P(A_n)=0 for all n. But A_n \nearrow \{X > 0\} and thus by sequential continuity P(X = 0)=1. \quad \square
6.5 Fatou’s Lemma
Our first convergence theorem was the monotone convergence theorem Theorem 6.2. Our second convergence theorem still deals with non-negative function random variables and is called the Fatou’s lemma.
Theorem 6.5 (Fatou’s Lemma) Suppose X_n are non-negative random variables. Then
E [\liminf_{n} X_n] \le \liminf_{n} [X_n]
Proof. Set Y_n= \inf_{m\ge n} X_m. Then Y_n \ge Y_{n+1} and \liminf_{n} X_n = \lim_{n} \inf_{m \ge n} X_m = \lim_{n} Y_n. We can use the monotone convergence theorem for the sequence Y_n to get
E[\liminf_{n} X_n] = \lim_{n} E[Y_n]\,.
\tag{6.5} Also for m \ge n we have Y_n =\inf_{k\ge n} X_k \le X_m and so by monotonicity E[Y_n] \le E[X_m] and thus
E[Y_n] \le \inf_{m \ge n}E[X_m]\,.
\tag{6.6} Combining Equation 6.7 and Equation 6.6 we find
E[\liminf_{n} X_n] \le \lim_{n} \inf_{m \ge n}E[X_m] = \liminf_{ n} E[X_n]
\tag{6.7}\quad \square
Variation on Fatou’s Lemma: One can deduce directly from Fatou’s Lemma the following results
If X_n \ge Y and Y is an integrable RV then E[\liminf_{n} X_n] \le \liminf_{n} E[X_n]. Proof: Apply Fatou’s Lemma to the RV Y_n=X_n-Y which is nonnegative.
If X_n \le Y and Y is an integrable RV E[\limsup_{n} X_n] \ge \limsup_{n} E[X_n]. Proof: Apply Fatou’s Lemma to the RV Y_n=Y-X_n which is nonnegative.
We shall use these versions of Fatou’s Lemma to prove our next big result, the Dominated Convergene Theorem.
The Fatou’s Lemma tells us that you can “lose probability” but never gain it. For example if X_n \to 0 then \liminf_n E[X_n] \ge 0.
For example if we take \Omega=[0,1] and P the Lebesgue measure. Let
X_n(\omega)= n 1_{[0,\frac{1}{n}]}(\omega)
Then E[X_n]=nP([0,\frac{1}{n}])=1 for all n. But X_n \to 0 a.s. and so E[\lim_n X_n] =0 \not =1.
6.6 Dominated convergence Theorem
Theorem 6.6 (Dominated convergence theorem) Suppose \{X_n\} is a collection of random variable such that
\lim_{n}X_n(\omega) =X(\omega) for all \omega
There exists an integrable random variable Y such that |X_n|\le Y for all n. Then
\lim_{n}E[X_n] = E[X] =E[\lim_{n}X_n]
Proof. We derive it from Fatou’s Lemma. The condition |X_n|\le Y means that -Y \le X_n \le Y.
Applying Fatou’s lemma to Y-X_n \ge 0 we find that
E[ \liminf_{n} (Y-X_n)] \le \liminf E[Y - X_n]
Using that \liminf_n (-a_n) = - \limsup_n a_n and \lim_n{X_n}=X we find
E[ \liminf_{n} (Y-X_n)] = E[Y] + E[ \liminf_{n} (-X_n)] = E[Y] - E[ \limsup_{n} X_n)] =E[Y]-E[X]
and \liminf E[Y - X_n] = E[Y] - \limsup_{n} E[X_n] and thus we have \limsup_{n} E[X_n] \le E[X]. Applying Fatou’s to X_n+Y\ge 0 yields in a similar manner E[X] \le \liminf_{n} E[X_n] (check this). Therefore we have \limsup_{n} E[X_n] \le E[X] \le \liminf_{n}E[X_n]. This proves that \lim_{n} E[X_n]=E[X]. \quad \square.
A special case of the dominated convergence theorem is frequently used
Theorem 6.7 (Bounded convergence theorem) Suppose \{X_n\} is a collection of random variable such that
\lim_{n}X_n(\omega) =X(\omega) for all \omega
There exists an integrable random variable c such that |X_n|\le c for all n. Then
\lim_{n}E[X_n] = E[X] =E[\lim_{n}X_n]
Proof. Y=c is integrable so the result follows from the dominated convergence theorem. \quad\square.
Remark on almost sure versions:
Monotone convergence theorem, Fatou’s lemma and dominated convergence theorem has also almost sure versions.
For example if Y is integrable and |X_n| \le Y almost surely and X_n(\omega)\to X almost surely then \lim_{n} E[X_n]=E[X]. To see this define
N=\{\omega \,:\, |X_n(\omega)|\le Y(\omega) \text{for all } n \} \quad \text{ and } \quad M= \{\omega\,:\,\lim_{n} X_n(\omega)= X(\omega) \}
Then P(M^c)=P(N^c)=0. We can modify the RV on sets of measures of 0 in such a way that the statements hold for all \omega: set X_n=0, X=0, Y=0 on M^c \cup N^c. Then the properties holds for all \omega and since the expectations do not change we are done.
6.7 The Expectation rule (very useful for computations)
Computing the expectation of RV X (or h(X)) can be done using either P (good for proofs) or P^X (good for computations). As we will see this is an abstract version of the change of variable formula from Calculus!
Notation Another widely used (and convenient) notation for the expectation is
E[X] = \int_\Omega X(\omega) dP(\omega)
Theorem 6.8 (Expectation rule) Suppose X is a RV on (\Omega, \mathcal{A}, P) taking value in (F, \mathcal{F}) and with distribution P^X. Let h:(F, \mathcal{F}) \to (\mathbb{R},\mathcal{B}) be measurable.
h(X) \in \mathcal{L}^1(\Omega, \mathcal{A}, P) if and only if h \in \mathcal{L}^1(F, \mathcal{F},P^X).
If either h \ge 0 or h satisifies the equivalent conditions in 1. we have
E[h(X)]= \int_\Omega h(X(\omega)) dP(\Omega) = \int_{F} h(x) d P^X(x)
\tag{6.8}
Conversely suppose Q is a probability measure on \mathbb{F} such that
E[h(X)]= \int_{F} h(x) dQ(x)
for all non-negative measurable h. Then Q=P^X.
Proof. The probability distribution of X, P^X, is defined by P^X(B) = P(X^{-1}(B)). Therefore
E[1_B(X)]= P(X \in B) = P^X(B) = \int_F 1_B(x) dP^X(x)
This prove Equation 6.8 in the case of a chracteristic function. By linearity, if h is a simple function then Equation 6.8 holds.
If h:F \to \mathbb{R} is positive then pick a sequence of simple function h_n such that h_n \nearrow h. Then
\begin{aligned}
E[h(X)] = E[\lim_{n \to \infty} h_n(X)] & = \lim_{n \to \infty} E[h_n(X)] \quad \text{ by the MCT in } \Omega \\
& = \lim_{n \to \infty} \int_F h_n(x) dP^X(x) \quad \text{ because } h_n \text{ is simple}. \\
& = \int_F \lim_{n \to \infty} h_n(x) dP^X(x) \quad \text{ by MCT in } F \\
& = \int_F h(x) dP^X(x)
\end{aligned}
This proves Equation 6.8 for h non-negative. If we apply this to |h| this proves part 1. of the Theorem. For general h, write h=h_+-h_- and deduce the result by substraction.
For the converse in item 3. just take f=1_A to be a characteristic function. Then
P(X \in A)=E[1_A(X)]= \int_F 1_A(x) dQ(x)=Q(A)\,.
Since A is arbitrary, the distribution of X is Q.
Consequences:
If X is a real-valued random variable, we can compute its expectation as doing an integral on \mathbb{R}
If X is real-valued and h: \mathbb{R} \to \mathbb{R} is a (measurable) function (e.g. X^{n}, or e^{i\alpha X}, or \cdots. Then we
have
E[X]= \int_{\mathbb{R}} x dP^X(x)\, \quad E[X^n]= \int_{\mathbb{R}} x^n dP^X(x)\, \quad \cdots
An alternative would to compute the distribution P^Y of the Y=X^n and then we have
E[X^n] = E[Y] = \int_{\mathbb{R}} y dP^Y(y)
Generally we will compute E[h(X)] using the distribution of X ….
But often we will work backward. We will use the change of variable formula to compute the distribution of Y (see item 3. in Theorem 6.8). Checking the equality for all non-negative function or all characteristic function is not always easy so we will show that one can restrict onesleves to just nice functions! (Later..)
Examples forthcoming
6.8 Applications of the expectations rule.
We investigate how the pdf of a random variable transform under a linear transformation X \mapsto Y= aX+b.
Theorem 6.9 Suppose the real-value RV X has the pdf f_X(t) then Y=aX+b has the pdf f_Y(y)=\frac{1}{|a|}f(\frac{y-b}{a})
Proof. For a change we prove it directly using the expectation rule. The pdf f_Y(y) must satisfy, for any nonnegative h,
E[h(Y)] = \int h(y) f_Y(y) dy
We rewrite this using the pdf of X using the expectation rule again and the change of variable y=ax+b
E[h(Y)] = E[h(aX+b)] = \int_{-\infty}^\infty h(ax+b) f_X(x) dy = \int_{-\infty}^\infty h(y)f_X\left(\frac{y-b}{a}\right) \frac{1}{|a|} dx
and therefore we must have f_Y(y)= \frac{1}{|a|} f_X\left(\frac{y-b}{a}\right) \frac{1}{|a|}.
Remark Alternatively you can prove this using the CDF, for example for a>0
F_Y(t)=P (Y \le t) = P(aX+b \le t) = P\left(X\le \frac{y-b}{a}\right)= F_X\left(\frac{y-b}{a}\right)
and then differentiate. Do the case a<0.
Location-scale family of random variables: A family of random variables parametrized by parameter \alpha \in \mathbb{R} (=location) and \beta \in (0,\infty) (=scale) is called a location-scale family if X belonginging to the family implies that Y=aX+b also belong to the family for any parameter a and b. If f has a density this is equivalent to require that the densities have the form
f_{\alpa,\beta}(x) = \frac{1}{\beta}f(\frac{x-\alpha}{\beta})
for some fixed function f(x).
Normal RV are scale/location family with parameters \mu (=location) and \sigma (=scale)$ and f(x)= \frac{e^{-x^2/2}}{\sqrt{2 \pi}}
The Cauchy distribution with pdf \frac{1}{\beta \pi} \frac{1}{1 + (\frac{x-\alpha}{\beta})^2} is also a location scale family.
Some family of distribution are only a scale family. For example the exponential random variables with density f(x) = \frac{1}{\beta} e^{-x/\beta} are a scale family with scaling parameter \beta.
6.9 Exercises
Exercise 6.1 Show the following facts:
Show that if X=0 almost surely if and only if E[X 1_A]=0 for all measurable sets A.
Suppose X is a random variable with E[X] < \infty. Show that X < \infty almost surely. Hint: Consider the set B_n=\{X \ge n\}.
Exercise 6.2 (sum of random variables) Suppose X_n is a collection of random variables defined on the probability space (\Omega,\mathcal{A},P).
Prove that if the X_n are all nonnegative then E[\sum_{k=1}^\infty X_k] = \sum_{k=1}^\infty E[X_k]. Hint: Use the monotone convergence theorem.
Prove that if \sum_{n=1}^\infty E[|X_k|] is finite then E[\sum_{k=1}^\infty X_n] = \sum_{k=1}^\infty E[X_k]. Hint: Consider the RV Y=\sum |X_k| and use the dominated convergence theorem and Exercise 7.1, part 2.
Exercise 6.3 (Building new probability measures using densities) Suppose Y is a random variable on the probability space (\Omega, \mathcal{A},P) with Y \ge 0 almost surely and E[Y]=1.
Define Q: \mathcal{A} \to \mathbb{R} by Q(A) = E[Y 1_A]. Show that Q is probability measure on (\Omega, \mathcal{A},P). We denote by E_Q the expectation with respect to Q.
Show, using the definition of the integral, that E_Q[X] = E[XY].
Show if B\in \mathcal{A} is such that P(B)=0 then we have Q(B)=0. (We say then that Q is absolutely continuous with respect to P.)
Show that, in general Q(B)=0 does not imply P(B)=0 but that if Y > 0 almost surely then Q(B)=0 does imply P(B)=0.
Assuming Y>0 almost surely show that \frac{1}{Y} is integrable with respect to Q and show that the measure R defined by R(A) = E_Q[\frac{1}{Y}1_A] is equal to P.
Exercise 6.4 (the log normal distribution)
Suppose X is a normal random variable. Show that the random variable Y=e^{X} has the distribution with the following density
f(x) =
\left\{
\begin{array}{cl}
\frac{1}{x} \frac{1}{\sqrt{2 \pi} \sigma} e^{ - (\log(x)-\mu)^2/2\sigma^2} & x\ge 0 \\\
0 & x < 0
\end{array}
\right.
The random variable Y is called the log-normal distribution with parameter \mu and \sigma^2.
Show that E[Y^r]= e^{r \mu + \frac{1}{2}\sigma^2 r^2}. Hint: Do the change of variables y=\log(x)-\mu in the integral for E[Y^r].
Exercise 6.5 (Cauchy distribution)
Suppose X is a random variable with density f_X(x). Express the density f_Y of Y=\frac{a}{X} in terms of f_X.
A Cauchy RV with parameters (\alpha, \beta) has the pdf f(x)=\frac{1}{\beta \pi}\frac{1}{1+ (x-\alpha)^2/\beta^2}.
Show that if X is a Cauchy RV so is Y=aX+b and find how the parameters transform.
Show that if X has a Cauchy distribution with \alpha=0 then \frac{1}{X} has again a Cauchy distribution.
Show that the mean and the variance of a Cauchy RV are undefined.
Exercise 6.6 Consider the RV X with CDF given by
F(t)=
\left\{
\begin{array}{cl}
0 & t \le -1 \\
1/4 + \frac{1}{3} (t+1)^2 & -1 \le t < 0 \\
1 - \frac{1}{4}e^{-2t} & t \ge 0
\end{array}
\right.
Compute E[X] and {\rm Var}(X).
7 Inequalities
7.1 Jensen inequality
Some facts about convex functions: Recall that a function h on \mathbb{R^d} is convex if
\phi( \alpha x + (1-\alpha)y) \le \alpha \phi(x) + (1-\alpha) \phi(y)
for all x,y and all 0 \le \alpha \le 1. This means that the line segment between (x,\phi(x)) and (y,\phi(y)) lies above the graph of \phi(z) for z lying on the line segement between x and y.
An equivalent description of a convex fuction (“obvious” from a picture, with a proof in the homework) is that at any point x_0 we can find a plane l(x) in \mathbb{R}^n \times \mathbb{R} such that \phi(x) \ge l(x)=f(x_0) + c\cdot(x-x_0) (the graph of f lies above l for all x) and \phi(x_0)=l(x_0). If f is differentiable at x_0 the plane is given by the tangent plane to the graph at x_0, we have
f(x) \ge f(x_0) + \nabla f(x_0)\cdot(x-x_0)
If f is twice continuously differentiable then f is convex if and only if the matrix of second derivative D^2f(x_0) is positive definite.
Theorem 7.1 (Jensen inequality) If \phi: \mathbb{R} \to \mathbb{R} is a convex function then
E[\phi(X)]\ge \phi(E[X])
provided both expectations exist, i.e. E[|X|]< \infty and E[|\phi(X)|] < \infty.
Proof. Choose x_0=E[X] and pick a supporting hyperplane l(x) at x_0 so that for any x
\phi(x) \ge \phi(E[X]) + l(E[X]) (x - E[X])
By the motonicity of expectation we obtain
E[\phi(X)] \ge \phi(E[X]) + l(E[X]) E[(X - E[X])] = \phi(E[X])\,.
Examples
Since f(x)=x^2 is convex we have E[X]^2 \le E[X^2].
Since f(x)=e^{\alpha x} is convex for any \alpha \in \mathbb{R} we have E[e^{\alpha X}] \ge e^{\alpha E[X]}.
Remark The theory of convex functions is very rich and immensely useful!
We will need the following slight generalization of Jensen inequality
Theorem 7.2 If \phi: \mathbb{R}^d \to \mathbb{R} is a convex function and X=(X_1, \cdots, X_d) is a RV taking values in \mathbb{R}^d. Then we have
E[\phi(X)]\ge \phi((E[X_1], \cdots, E[X_d]))
provided both expectations exist.
Proof. Same proof as Jensen.
7.2L^p-norms
Suppose (\Omega,\mathcal{A},P) is a probability space and X is a real-valued random variable.
Definition 7.1 (L^p-norms) Given a random variable X and 1 \le p \le \infty we define
\|X\|_p = E[|X|^p]^\frac{1}{p} \quad \text{ for } 1 \le p < \infty
and
\|X\|_\infty= \inf\{ b \in \mathbb{R}_+\,:\, |X|\le b \textrm{ a.s} \}
and \|X\|_p is called the L^p norm of a RV X.
Remarks
It is easy to check that
\|X\|_p=0 \implies X=0 \text{almost surely},
\|cX\|_p = c \|X\|_p.
\|X\|_p <\infty means that |X|^p is integrable (if 1 \le p < \infty) and that X is almost surely bounded (if p=\infty). Often \|X\|_\infty is called the essential supremum of X.
7.3 Cauchy-Schwartz, Hölder, Minkowski
Theorem 7.3 (Hölder and Minkowski inequalities)
H"older: Suppose 1\le p,q\le \infty are such that \frac{1}{p}+\frac{1}{q}=1 then we have
\|X Y\|_1 \le \|X\|_p \|Y\|_q\,.
Special case is the Cauchy-Schwartz inequality p=q=2
\|X Y\|_1 \le \|X\|_2 \|Y\|_2 \,.
Minkowski: For 1\le p \le \infty we have
\|X + Y\|_p \le \|X\|_p + \|Y\|_p \,.
(a.k.a triangle inequality)
Proof. The proof is ultimately a consequence of Jensen inequality (there are many different proofs but all relies in one way or the other on convexity).
To start one shows that the functions
f(u,v)=u^{b} v^{1-b} \quad \text{ and } g(u,v)=(u^{b} + v^{b})^\frac{1}{b}
\tag{7.1} are concave on \mathbb{R}^2_+ for b \in (0,1]) (that is -f and -g are convex). To see this compute for example D^2f and D^2g.
Once this is done let us turn to Hölder inequality:
If p=1 and q=\infty then we have |XY|\le|X| \|Y\|_\infty almost surely and thus \|XY\|_1\le \|X\|_1\|Y\|_\infty.
The convexity of f in Equation 7.1 implies that for non negative random variables U and V we have
E[ U^b V^{1-b}] \le E[U]^b E[V]^{1-b}\,.
If 1 < p < q <\infty then we set b=\frac{1}{p} and 1-b=\frac{1}{q} and U=|X|^p and V=|Y|^q. Then
E[ |X| |Y| ] = E\left[ (|X|^p)^\frac{1}{p} (|Y|^q)^\frac{1}{q}\right] \le E[|X|^p]^\frac1p E[|Y|^q]^\frac{1}{q}
For Minkowski
If p=\infty Minkowski inequality is easy to check.
The convexity of f in Equation 7.1 implies that for non negative random variables U and V we have
E[ U^b + V^{b}] \le \left(E[U]^b + E[V]^{b}\right)^{1/b}\,.
which implies Minkovski if we take b=\frac{1}{p} and U=|X|^pV=|Y|^q.
\quad \square.
Definition 7.2 (L^p spaces) For 1\le p \le \infty we define
\mathcal{L}^p(\Omega,\mathcal{A},P) =\left\{ X: \Omega \to \mathbb{R}, \|X\|_p < \infty \right\}
and the quotient space
L^p(\Omega,\mathcal{A},P) = \mathcal{L}^p(\Omega,\mathcal{A},P)/\sim
where X \sim Y means X=Y a.s is an equivalence relation.
The space L^p(\Omega,\mathcal{A},P) is a normed vector space.
Theorem 7.4 The map p \mapsto \|X\|_p is an increasing map from [1,\infty) to [0,\infty).
If \|X\|_\infty < \infty then p \mapsto \|X\|_p is continuous on [1,\infty].
If \|X\|_\infty = \infty there exists q such that \|X\|_p is continuous on [1,q] (that is left-continuous at q) and \|X\|_p=+\infty on (q,\infty).
Proof. Homework
Examples:
If X has a Pareto distribution with parameter \alpha and x_0 then its the CDF is
F(t)= 1- \left(\frac{t}{x_0}\right)^\alpha \quad \text{ for } t \ge x_0
and F(t)=0 for t \le x_0.
The pdf is f(x) = \frac{\alpha x_0^\alpha}{x^{\alpha+1}} and we have
E[|X|^p] = E[X^p] = \int_{x_0}^\infty \alpha x_0^\alpha x^{-\alpha -1 + p} dx =
\left\{
\begin{array}{cl}
\frac{\alpha}{\alpha-\beta} x_0^{\beta}& p <\alpha \\
+\infty & \beta \ge p
\end{array}
\right.
If X has a normal distribution (or an exponential, gamma, etc…} then X \in L^p for all 1\le p <\infty but X \notin L^\infty.
Other norms exists (Orlicz norms) to capture the tail of random variables.
7.4 Markov, Chebyshev, and Chernov
Another very important inequality is the so-called Markov equality. Very simple and very useful.
Theorem 7.5 (Markov inequality) If X \ge 0 then for any a>0
P(X \ge a) \le \frac{E[X]}{a}
Proof. Using that X is non-negative we have
X \ge X 1_{X\ge a} \ge a 1_{X\ge a}
and taking expectation and monotonicity gives the result.
Theorem 7.6 (Chebyshev inequality) We have
P(|X-E[X]| \ge \varepsilon) \le \frac{{\rm Var}[X]}{\varepsilon^2}
where {\rm Var}(X)=E[X - E[X]] is the variance of X.
Proof. Apply Markov inequality to the random variable (X- E[X])^2 whose expectation is {\rm Var}[X]:
P(|X-E[X]| \ge \varepsilon) = P((X-E[X])^2 \ge \varepsilon^2) \le \frac{E[(X-E[X])^2]}{\varepsilon^2} =\frac{{\rm Var}[X]}{\varepsilon^2}
Chebyshev inequality suggests measuring deviation from the mean in multiple of the standard deviation \sigma= \sqrt{{\rm Var}[X]}:
P(|X-E[X]| \ge k \sigma ) \le \frac{1}{k^2}
Chebyshev inequality might be extremly pessimistic
Chebyshev is sharp. Consider the RV X with distributioncP(X=\pm 1) = \frac{1}{2k^2} \quad P(X=0)=1 -\frac{1}{k^2} Then E[X]=0 and {\rm Var}[X]=\frac{1}{k^2}
P( |X|\ge k \sigma) = P(|X| \ge 1 )=\frac{1}{k^2}
Theorem 7.7 (Chernov inequality) We have for any a
P(X \ge a) \le \inf_{t \ge 0} \frac{E[e^{tX}]}{e^{ta}}
\quad \quad P(X \le a) \le \inf_{t \le 0} \frac{E[e^{tX}]}{e^{ta}}
Proof. This is again an application of Markov inequality. If t \ge 0 since the function e^{tx} is increasing
P( X \ge a) = P( e^{tX} \ge e^{ta} ) \le \frac{E[e^{tX}]}{e^{ta}}
Since this holds for any t\ge 0 we can then optimize over t. The second inequality is proved in the same manner. \quad \square
Chernov inequality is a very sharp inequality as we will explore later on when studying the law of large numbers. The optimization over t is the key ingredient which ensures sharpness.
The function M(t)=E[e^{tX}] is called the moment generating function for the RV X and we will meet again.
Example Suppose X is a standard normal random variable \mu=0 and \sigma^2. Then, completing the square, we have
E[e^{tX}]= \frac{1}{\sqrt{2\pi}} \int e^{tx} e^{-x^2/2\sigma^2} dx = \int e^{-(x-\sigma^2t)^2/2} e^{\sigma^2t^2/2} = e^{\sigma^2t^2/2}
and Chernov bound gives for a \ge 0 that P(X \ge a) \le \sup_{t\ge 0} e^{\sigma^2t^2/2 -ta} = e^{- \inf_{t\ge 0} (ta - \sigma^2t^2/2)} = e^{-a^2/2\sigma^2} which turns out to be sharp up to a prefactor (see exercises).
7.5 Exercises
Exercise 7.1
Prove the one-sided Chebyshev inequality: if X is a random variable and \epsilon >0 then
P( X - E[X] \ge \epsilon ) \le \frac{\sigma^2}{\sigma^2 + \epsilon^2}
where \sigma^2={\rm Var}(X). Hint: Set Y=X-E[X] and use Markov inequality for P(Y \ge \epsilon) = P( (Y+\alpha)^2 \ge (\epsilon+ \alpha)^2) and optimize over \alpha
Prove that
The one-sided Chebyshev inequality is sharper than the Chebyshev inequality for one sided bounds P(X - E[X] \ge \epsilon).
The Chebyshev inequality is is sharper than the one-sided Chebyshev inquality for two sided bound P(|X - E[X]| \ge \epsilon)
Exercise 7.2 Prove Theorem 7.4. For the monotonicity use Hölder or Jensen. For the continuity let p_n \nearrow q and use the dominated convergence theorem.
Exercise 7.3 The Chernov bound has the form
P( X \ge a) \le e^{- \sup_{t \ge 0}\{ ta - \ln E[e^{tX}]\}}.
Show that this bound is useful only if a > E[X]. To do this use Jensen inequality to show that if a \le E[X] the Chernov bound is trivial.
Exercise 7.4 Consider an exponential RV X with parameter \lambda and density \lambda e^{-\lambda t} for t \ge 0.
Compute M(t)=e^{tX} as well as all moments E[X^n]
To do a Chernov bound compute \sup_{t \ge 0}\{ ta - \ln E[e^{tX}] \} (see also Exercise 7.3).
For a > \frac{1}{\lambda} estimate P( X >a) (which of course is equal to e^{-\lambda a}!) using
Markov inequality.
Chebyshev
One-sided Chebyshev
Chernov
Exercise 7.5 (mean and median of a RV) A medianm for a RV X is a value m such that P(X\le m) \ge \frac{1}{2} and P(X \ge m) \ge 1/2 (see the quantile). For example if the CDF is one-to-one then the median m=F^{-1}(\frac{1}{2}) is unique.
The median m and the mean \mu= E[X] are two measures (usually distinct) of the “central value” of the RV X.
Consider the minimum square deviation \min_{a \in \mathbb{R}} E[ (X-a)^2]. Show that the minimum is attained when a=E[X].
Consider the minimum absolute deviation \min_{a \in \mathbb{R}} E[ |X-a|]. Show that the minimum is attained when a is a median m. Hint: Suppose a >m then we have
|z-a| - |z-m| = \left\{
\begin{array}{ll}
m - a & \textrm{ if } z \ge a \\
a +m -2z \quad (\ge m -a) & \textrm{ if } m < z < a \\
a-m & \textrm{ if } z \le m
\end{array}
\right.
Exercise 7.6 (mean and median of a RV, continued) Our goal is to prove a bound on how far the mean and the median can be apart from each other, namely that
|\mu - m| \le \sigma
where \sigma is the standard deviation of X. I am asking for two proofs:
First proof: Use the characterization of the median in Exercise 7.5 and Jensen inequality (twice) starting from |\mu-m|.
Second proof: Use the one-sided Chebyshev for X and -X with \epsilon = \sigma.
The quantity
S=\frac{\mu-m}{\sigma} \in [-1,1]
is called the non-parametric skew of X and measures the assymetry of the distribution of X.
8 Independence and product measures
8.1 Independence
Two events A,B \in \mathcal{A} are independent if P(A|B)=P(A) or equivalently P(A \cap B) =P(A) P(B) and this generalizes to arbitrary collection of events (see Definition 1.8)
For two random variable X and Y to be independent it should means that any information or knowledege derived from the RV Y should not influence the RV X. All the information encoded in the RV X taking values in (E,\mathcal{E}) is the \sigma-algebra generated by X that is \sigma(X) = X^{-1}(\mathcal{E}) = \left\{ X^{-1}(B), B \in \mathcal{E}\right\}. This motivates the following definition.
Definition 8.1 (Independence) Let (\Omega, \mathcal{A},P) be a probability space.
Independence of \sigma-algebras: Two sub-\sigma-algebra \mathcal{A}_1 \subset \mathcal{A} and \mathcal{A}_2 \subset \mathcal{A} are independent if
P(A_1 \cap A_2) = P(A_1) P(A_2) \quad \text{ for all } A_1 \in \mathcal{A}_1, A_2 \in \mathcal{A}_2\,.
A collection (not necessarily countable) of \sigma-algebras \left\{\mathcal{A}_j\right\}_{j \in J} is independent if for any finite subset I \subset J we have
P\left( \bigcap_{i\in I} A_i\right) = \prod_{i \in I} P(A_i) \quad \text{ for all } A_i \in \mathcal{A}_i \,.
Independence of random variables: The collection of random variables X_j: (\Omega, \mathcal{A},P) \to (E_j, \mathcal{E}_j) for j \in J are indedpendent if the collection of \sigma-algebras \sigma(X_j) are independent.
We consider from now on only two random variables X and Y but all generalizes easily to arbitrary collections. Our next theorem makes the definition a bit more easy to check.
Theorem 8.1 (Characterization of independence) Two random variables X (taking values in (E,\mathcal{E})) and Y taking values in (F,\mathcal{F})) to be independent if and only if any of the following equivalent conditions holds.
P(X \in A, Y \in B)=P(X \in A) P(X \in B) for all A \in \mathcal{E} and for all B \in \mathcal{F}.
$P(X A, Y B)=P(X A) P(X B) $ for all A \in \mathcal{C} and for all B \in \mathcal{D} where \mathcal{C} and \mathcal{D} are p-systems generating \mathcal{E} and \mathcal{F}.
f(X) and g(Y) are independent for any measurable f and g.
E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and measurable (or nonenegative) f,g.
If E=F=\mathbb{R} (or more generally E and F are complete separabale metric spaces) E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and continuous functions f,g.
Proof.
Item 1. is merely a restatement of the definition and clearly item 1. \implies item 2.
To see that Item 2. \implies item 1. we use the monotone class theorem Theorem 3.1. Fix B \in \mathcal{D} then the collection
\left\{ A \in \mathcal{E}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\}
contains the p-system \mathcal{C} and is a d-system (contains \Omega, closed under complement, closed under increasing limits, check this please). Therefore by the monotone class theorem it contains \mathcal{E}. Analgously fix now A \in \mathcal{E}, then the set \left\{ B \in \mathcal{F}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\} contains \mathcal{F}.
To see that item 3. \implies item 1. first we take f(x)=x and g(y)=y. Conversely we note that V=f(X) is measurable with respect to \sigma(X): since V^{-1}(B) = X^{-1} (f^{-1}(B)) \in \sigma(X) this shows that \sigma(f(X))\subset \sigma(X). Likewise \sigma(g(Y)) \subset \sigma(Y). Since \sigma(X) and \sigma(Y) are independent so are \sigma(f(X)) and \sigma(g(Y)).
To see that item 4. \implies item 1. take f=1_A and g=1_B. To show that item 1. \implies item 4. note that item 1. can be rewritten that E[1_A(X) 1_B(Y)] =E[1_A(X)] E[1_B(Y)]. By linearity of the expectation then item 4. holds for all simple functions g and g. If f and g are non negative then we choose sequences of simple functions such that f_n \nearrow f anf g_n \nearrow g. We have then f_n g_n \nearrow fg and using the monotone convergence theorem twice we have
\begin{aligned}
E[f(X)g(Y)] & = E[ \lim_{n} f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) ] E[g_n(Y)] \\
&= \lim_{n} E[ f_n(X) ] \lim_{n} E[g_n(Y)] =E[ f(X) ] E[g(Y)]
\end{aligned}
If f and g are bounded and measurable then we write f=f_+-f_- and g=g_=-g_-. Then f_\pm and g_\pm are bounded and measurable and thus the product of f_\pm(X) and g_\pm(Y) are also integrable. We have
\begin{aligned}
E[f(X)g(Y)] & = E[(f_+(X)-f_-(X))(g_+(Y)-g_-(Y))] \\
&= E[ f_+(X)g_+(Y)]+ E[f_-(X)g_-(Y)] - E[f_+(X)g_-(Y)]- E[f_-(X)(g_+(Y)] \\
&= E[ f_+(X)]E[g_+(Y)]+ E[f_-(X)]E[g_-(Y)] - E[f_+(X)]E[g_-(Y)]- E[f_-(X])E[g_+(Y)] \\
&= E[ f_+(X)-f_-(X)] E[ g_+(Y)-g_-(Y)]
\end{aligned}
Clearly item 1. \implies item 4. \implies
5.. For the converse we show that item 5. \implies item 2. Given an interval (a,b) consider an increasing sequence of piecewise linear function continuous function f_n \nearrow 1_{(a,b)}.
f_n(t) = \left\{
\begin{array}{cl}
0 & t \le a + \frac{1}{n} \text{ or } t \ge b-\frac{1}{n} \\
1 & a+\frac{1}{n} \le t \le b-\frac{1}{n}\\
\text{linear} & \text{otherwise }
\end{array}
\right.
Let us consider the p-system which contains all intervals of the form (a,b) and which generate the Borel \sigma-algebra \mathcal{B}. By using a monotone convergence argument like for item 1. \implies item 4. we see that for f_n \nearrow 1_{(a,b)} and g_n \nearrow 1_{(c,d)}
E[f_n(X) g_n(Y)]= E[f_n(X)] E[g_n(Y)] \implies P(X \in (a,b)) P(Y \in (c,d))
and so item 2. holds.
For separable metric space, the so-called Urysohn lemma can be used to prove the same results. \quad \square
8.2 Independence and product measures
While we have expressed so far independence of X and Y as a property on the state \Omega we can also view it as a property of the distributions P^X and P^Y.
Example: Suppose X and Y are discrete random variables then if they are independent we have
P(X =i, Y=j) = P(X=i) P(Y=j)
and thus P^{(X,Y)}(i,j) = P^X(i) P^Y(j). That is the distribution of the random variable Z=(X,Y) factorizes into the product of P^X and P^Y.
[Product spaces.}{.red} In order to build-up more examples we need the so-called Fubini theorem. Given two measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) we consider the product space (E \times F, \mathcal{E} \otimes \mathcal{F}) where \mathcal{E} \otimes \mathcal{F} is the sigma-algebra generated by the rectangles A \times B.
[Measurable functions on product spaces.}{.red} As we have seen in Exercise 4.6 for any measurable function f: E \times F \to \mathbb{R} the sections g(y)=f(x,y) (for any x) and h(x)=f(x,y) (for any y) are measurable.
Theorem 8.2 (Tonelli-Fubini Theorem) Suppose P is a probability on (E,\mathcal{E}) and Q is a probability on (F,\mathcal{F}).
The function R: \mathcal{E}\times \mathcal{F} defined by
R(A \times B) = P(A) Q(B)
extends to a unique probability measure on \mathcal{E}\otimes \mathcal{F}. This measure is denoted by P \otimes Q and is called the product measure of P and Q.
Suppose f is measurable with respect to \mathcal{E}\otimes \mathcal{F}, either non-negative, or integrable with respect to P \otimes Q. Then the functions
x \mapsto \int f(x,y) dQ(y) \quad \text{ and } \quad y \mapsto \int f(x,y) dP(x)
are integrable with respect to P and Q respectively and we have
\int_{E \times F} f(x,y) d (P\otimes Q) (x,y) = \int_E \left(\int_F f(x,y) dQ(y)\right) dP(x) = \int_F \left(\int_E f(x,y) dP(x)\right) dQ(y)
Proof. For item 1. we need to extend the R to arbitray element in \mathcal{E}\otimes\mathcal{F}. For C \in\mathcal{E}\otimes\mathcal{F} consider the slice of C along x given by
C(x) = \left\{ y \in F\,: (x,y) \in C \right\} \,.
If C = A \times B, then C(x)=B for all x\in A and C(x)=\emptyset for x \notin A and we have then
R(C) = P(A)Q(B) = \int_E Q(C(x)) dP(x)
Now we define
\mathcal{H}=\left\{ C \in \mathcal{E}\otimes\mathcal{F}\,:\, x \to Q(C(x)) \text{ is measurable} \right\}
it is not difficult to check that \mathcal{H} is a \sigma-algebra and \mathcal{H} \supset E \times F and therefore \mathcal{H} = E \otimes F.
We now define, for any C \in \mathcal{H} = E \otimes F,
R(C) = R(C) = P(A \times B) = \int_E Q(C(x)) dP(x)
and we check this is a probability measure. Clearly R(E \times F) = P(E)Q(F)=1. Let C_n \in E \otimes F be pairwise disjoint and C = \cup_{n=1}^\infty C_n. Then the slices C_n(x) are pairwise disjoint and by the monotone convergence theorem Q(C(x))=\sum_{n=1}^\infty Q(C_n(x)).
Applying MCT again to the function g_n(x) = \sum_{k=1}^n Q(C_n(x)) we find that
\sum_{n=1}^\infty R(C_n) = \sum_{n=1}^\infty \int_E Q(C_n(x)) dP(x) = \int_E \sum_{n=1}^\infty Q(C_n(x)) dP(x) = \int_E Q(C(x)) dP(x) =R(C).
and this shows that R is a probability measure. Uniqueness of R follows from the monotone class theorem.
For item 2. note that in item 1, we have proved it in the case where f(x,y)= 1_C(x,y). By linearity the result then holds for simple functions. If f is nonnegative and measurable then pick an increasing sequence such that f_n \nearrow f. Then by MCT
\int f(x,y) d P\otimes Q(x,y) = \lim_{n} \int f_n(x,y) d P\otimes Q(x,y) = \lim_{n} \int_E (\int_F f_n(x,y) d Q(y)) dP(x)
But the function x \to \int_F f_n(x,y) d Q(y) is increasing in n and by MCT again, and again.
\begin{aligned}
\int f(x,y) d P\otimes Q(x,y) & = \int_E \lim_{n} (\int_F f_n(x,y) d Q(y)) dP(x) = \int_E (\int_F \lim_{n} f_n(x,y) d Q(y)) dP(x) \\
&= \int_E (\int_F f(x,y) d Q(y)) dP(x)
\end{aligned}
Simlarly one shows that \int f(x,y) d P\otimes Q(x,y) = \int_F (\int_E f(x,y) d P(x)) dQ(y). The result for integrable f follows by decomposing into postive and negative part. \quad \square
Applying this to random variables we get
Corollary 8.1 Suppose Z=(X,Y): (\Omega, \mathcal{A}, P) \to (E \times F, \mathcal{E}\otimes \mathcal{F}). Then the random variables X and Y are independent if and only if the distribution P^{(X,Y)} of the pair (X,Y) is equal to P^X \otimes P^Y.
Proof. The random variables X and Yare independent if and only if
P((X,Y) \in A \times B)=P(X \in A) P(X \in B)\,.
This is equivalent to saying that
P^{(X,Y)}(A \times B) = P^X(A) \times P^Y(B)
By the uniqueness in Fubini Theorem we have P^{(X,Y)}=P^X \otimes P^Y. \quad \square
8.3 Constructing a probability space for independent random variables
We can construct a probability model for X_1, \cdots, X_n which are independent (real-valued) random variables with given distribution P^{X_1}, \cdots ,P^{X_n}.
By Theorem 5.6 we know how to construct the probability space for each random variable separately, for example,
X_i : (\Omega_i,\mathcal{B_i},P_i) \to (\mathbb{R}, \mathcal{B})
where \Omega_i=[0,1], P_i is Lebesgue measure on [0,1] and X_i = Q_i where Q_i is a quantile function for X_i.
We know take
\Omega = \prod_{i} \Omega_i = [0,1]^n \,, \quad \mathcal{B}= \otimes_{i=1}^n \mathcal{B_i}, \quad P= \otimes_{i=1}^n P_i
and define the map X: \Omega \to \mathbb{R}^n by X=(X_1,\cdots X_n).
Fubini-Tonelli Theorems shows that P^{X}=P \circ X^{-1} is the distribution of X=(X_1,\cdots X_n) on \mathbb{R}^n with the product \sigma-algebra and that the random variables are independent
Fubini-Tonelli for countably many RV:
We extend this result to countable many independent random variables: this is important in practice where we need such models: for example flipping a coin as many times as needed!
This can be seen as an extension of Fubini theorem and is also a specical case of the so-called Kolmogorov extension theorem which us used to construct general probability measures on infinite product space. No proof is given here.
Infinite product \sigma-algebras
Given \sigma-algebras \mathcal{A}_j on \Omega_j we set \Omega = \prod_{j=1}^\infty \Omega_j and define rectangles for n_1 < n_2 < \cdots < n_k and k finite but arbitrary
A_{n_1} \times A_{n_2} \times \cdots \times A_{n_k} = \{ \omega=(\omega_1,\omega_2,\omega_3, \cdots) \in \Omega \;: w_{n_j} \in A_{n_j}\}
where A_{n_j} \in \mathcal{A_{n_j}}. The product \sigma-algebra\mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j the \sigma-algebras generated by all the rectangles.
Theorem 8.3 Given probaility spaces (\Omega_i,\mathcal{A}_i,P_i) and with \Omega = \prod_{j=1}^\infty \Omega_i and \mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j there exists a unique probability P on \Omega(A) such that
P(A_{n_1} \times \cdots \times A_{n_k}) = \prod_{j=1}^k P_{n_j}(A_{n_j})
for all A_{n_j}\in \mathcal{A}_{n_j}, all n_1< \cdots < n_k and all k.
If we have a RV X_n: \Omega_n \to \mathbb{R} then we define its extension to \Omega by
\tilde{X_n}(\omega) = X_n(\omega_n)
and the distribution of \tilde{X_n} is the same as the distribution of X_n because
\tilde{X_n}^{-1}(B_n) = \Omega_1 \times \cdots \times \Omega_{n-1} \times X_n^{-1}(B_n) \times \Omega_{n+1} \times \cdots
and thus
P(\tilde{X_n}\in B_n) =P_n (X_n \in B_n)\,.
A similar computation shows that \tilde{X}_n and \tilde{X}_m are independent for n \not= m or more generally any finite collection of X_j's are independent.
8.4 Kolmogorov zero-one law
We consider (X_n)_{n=1}^\infty to be RVs defined on some probability space \Omega. We may think of n as “time” and we consider the following \sigma-algebras
\begin{aligned}
\mathcal{B}_n & = \sigma(X_n) & \text{the $\sigma$-algebra generated by } X_n \\
\mathcal{C}_n & = \sigma\left( \cup_{p \ge n} \mathcal{B}_n\right) & \text{the $\sigma$-algebra describing the "future" after time } n \\
\mathcal{C}_\infty & = \cap_{n=1}^\infty \mathcal{C}_n & \text{the "tail" $\sigma$-algebra or $\sigma$-algebra "at infinity"}
\end{aligned}
Theorem 8.4 (Zero-one law) Suppose X_n is a sequence of independent random variables and let \mathcal{C_\infty} be the corresponding tail \sigma-algebra. Then we have
C \in \mathcal{C}_\infty\implies P(C)=0 \text{ or } P(C)=1
Proof. The \sigma algebras \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_n\} are independent and therefore \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_\infty\} are independent for every n since \mathcal{C}_\infty \subset \mathcal{C}_n. Therefore \mathcal{C}_0=\sigma\left( \cup_{n \ge 0} \mathcal{B}_n\right) and \mathcal{C}_\infty are independent. So for A \in \mathcal{C}_0 and B \in \mathcal{C}_\infty we have P(A \cap B)=P(A) P(B). This holds also for A=B since \mathcal{C}_\infty \subset \mathcal{C}_0. Therefore we have P(A)=P(A)^2 which is possible only if P(A)=0 or 1\quad \square.
Examples Given independent random variable X_1, X_2, \cdots we define S_n = X_1 + X_2 +\cdots + X_n.
The event \{ \omega\,:\, \lim_{n} X_n(\omega) \text{ exists }\} belongs to every \mathcal{C}_n and thus belong to \mathcal{C}_\infty. Therefore X_n either converges a.s or diverges a.s.
A random variable which is measurable with respect to \mathcal{C}_\infty must be constant almost surely. Therefore
\limsup_n X_n \,, \quad \liminf_n X_n\,, \quad \limsup_n \frac{1}{n} S_n\,, \quad \liminf_n \frac{1}{n} S_n
are all constant almost surely.
The event \limsup_{n} \{ X_n \in B\} (also called \{ X_n \in B \text{ infinitely often }\}) is in \mathcal{C}_\infty.
The event \limsup_{n} \{ S_n \in B\} is not in \mathcal{C}_\infty.
8.5 Exercises
Exercise 8.1
Suppose X\ge 0 is a non-negative random variable and p > 0. Show that
E[X^p]= \int_0^\infty p t^{p-1} (1- F(t)) dt = \int_0^\infty p t^{p-1} P(X>t) dt
In particular E[X]=\int P(X>t) dt. Hint: Use Fubini on the product measure P times Lebesgue measure on [0,\infty).
Deduce from this that if X takes non-negative integer values we have
E[X]= \sum_{n >0} P( X>n)\,, \quad E[X^2]= 2 \sum_{n >0} n P( X>n) + E[X] \,.
Exercise 8.2 Find three random variable X, Y, and Z taking values in \{-1,1\} which are pairwise independent but are not independent.
Exercise 8.3
A random variable is a Bernoulli RV if X takes only values 0 or 1 and then E[X]=P(X=1) (alternatively you can think of X(\omega)=1_A(\omega) for some measurable set A.) Show that Y=1-X is also Bernoulli and that the product of two Bernoulli random variables is again a Bernoulli RV (no independence required).
Suppose X_1, X_2, \cdots, X_n are Bernoulli random variables on some probability space (\Omega,\mathcal{A},P) (they are not assumed to be independent) and let Y_k = 1-X_k. Show that
P(X_1=0, X_2=0, \cdots, X_n=0) =E\left[ \prod_{i=1}^n Y_i\right]
and
P(X_1=0, \cdots, X_k =0,X_{k+1}=1, \cdots, X_n=1) =E \left[ \prod_{i=1}^k Y_i \prod_{j=k+1}^n X_j\right]
Show that the Bernoulli random variables X_1, \cdots, X_n are independent if and only E[\prod_{j\in J} X_j]= \prod_{j\in J}E[X_j] for all subset J of \{1,\cdots, n\}.
Exercise 8.4 Consider the probability space ([0,1), \mathcal{B}, P) where \mathcal B is the Borel \sigma-algebra and P is the uniform distribution on [0,1). Expand each number \omega in [0,1) in dyadic expansion (or bits)
\omega = \sum_{n=1}^\infty \frac{d_n(\omega)}{2^n} =0. d_1(\omega)d_2(\omega)d_3(\omega) \cdots \quad \quad \textrm{ with } \quad d_n(\omega) \in \{0,1\}
For certain numbers the dyadic expansion is not unique. For example we have
\frac{1}{2} = 0.100000 = \frac{1}{4} + \frac{1}{8} + \frac{1}{16} + \cdots = 0.01111111111\cdots
Show that P almost surely a number \omega has a unique dyadic expansion.
Prove that each d_n(\omega) is a random variable and that they are identically distributed (i.e. each d_n has the same distribution).
Prove that the d_n, n=1,2,3, \cdots are a collection of independent and identically distributed random variables.
Remark: this problem shows that you can think of the Lebesgue measure on [0,1) as the infinite product measure of independent Bernoulli trials.
Exercise 8.5 Suppose X and Y are RV with finite variance (or equivalently X,Y\in L^2. The covariance of X and Y is defined by
{\rm Cov}(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY]- E[X]E[Y]
Show that {\rm Cov}(X,Y) is well defined and bounded by \sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}.
The correlation coefficient \rho(X,Y)=\frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}} measure the correlation between X and Y. Given a number \alpha \in [-1,1] find two random variables X and Y such that \rho(X,Y)=\alpha.
Show that {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(X+Y) + 2{\rm Cov}(X,Y).
Show that if X and Y are independent then {\rm Cov}(X,Y)=0 and so {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(Y).
The converse statement of 3. is in general not true, that is {\rm Cov}(X,Y)=0 does not imply that X and Y are independent. ` Hint For example take X to be standard normal Z discrete with P(Z=\pm 1)=\frac12 (Z is sometimes called a Rademacher RV) and Y=ZX.
Suppose X_1, \cdots, X_n are independent RV with E[X_j]=\mu and {\rm Var}(X_j)=\sigma^2< \infty for all j. The sample mean \overline{X} and sample variance S^2 are given by
\overline{X}=\frac{1}{n}\sum_{j=1}^n X_j \quad \quad S^2= \frac{1}{n-1}\sum_{j=1}^n(X_j -\overline{X})^2
Show that E[\overline{X}]=\mu, {\rm Var(\overline{X})=\frac{\sigma^2}{n}} and E[S^2]=\sigma^2.
9 Probability kernels and measures on product spaces
9.1 Probability kernels
How do we build “general” measures on some product space E \times F (e.g. \mathbb{R}^2 or \mathbb{R}^n)
Definition 9.1 Given 2 measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) a probability kernel K(x,B) from (E,\mathcal{E}) into (F,\mathcal{F}) (also often called a Markov kernel) is a map
K: E \times \mathcal{F} \to \mathbb{R}
such that
For any B\in \mathcal{F} the map x \to K(x,B) is a measurable map.
For any x\in E the map B \to K(x,B) is a probability measure on (F,\mathcal{F}).
Examples
If E and F are countable sets then a kernel is built from a transition matrixK(i,j), i \in E and j \in F such that
K(i,j) \ge 0 \text{ and } \sum_{j \in F} K(i,j) =1
We have then K(i,B)= \sum_{j \in B} K(i,j).
Suppose (F,\mathcal{F})=(\mathbb{R},\mathcal{B}) then a kernel is from a “conditional density” k(x,y) where x \in E and y\in \mathbb{R} and such that g is measurable and \int_R k(x,y) dx=1 for all x (see examples below).
The following theorem is a generalization of Fubini-Tonelli
Theorem 9.1 Let P be a probability measure on (E,\mathcal{E}) and K(x,B) a probability kernel from (E,\mathcal{E}) into (F, \mathcal{F}). Then a probability measure R on (E \times F, \mathcal{E}\otimes \mathcal{F}) is defined by
\int f(x,y) dR(x,y) = \int_X \left(\int_F f(x,y) K(x,dy)\right) dP(x)
for any measurable non-negative f. In particular we have for any A \in \mathcal{E} and B \in \mathcal{F}
R( A \times B) = \int 1_A(x) 1_B(y) dR(x,y) = \int 1_A(x) k(x,B) dP(x) = \int_A K(x,B) dP(x)\,.
We write
R(dx,dy)=P(dx)K(x,dy)
Proof. The proof is very similar to Fubini-Tonelli theorem and is thus omitted.
This theorem is intimately related to the concepts of conditional probability and conditional expectations which we will study later.
Roughly speaking, on nice probability space (e.g. separable metric spaces), every probaility measure on the product space E \times F can be constructed in the way described in Theorem 9.1 (this is a deep result).
Definition 9.2 (marginals of a probability measure) Given a probability measure R on the product space E \times F, \mathcal{E}\otimes \mathcal{F} the marginals of R are on E and F are defined defined to be the measures given by
\begin{aligned}
P(A) &= R(A \times F)\, \quad A \in \mathcal{E} \\
Q(B) &= R(E \times B)\, \quad B \in \mathcal{F}
\end{aligned}
Alternatively we can think of the marginals as the image measure
P = R \circ \pi_E^{-1}, \quad \quad P = R \circ \pi_F^{-1}
where \pi_E and \pi_F are the projection maps \pi_E(x,y) = x and \pi_F(x,y) = y
If R=P \otimes Q is a product measure then the marginal of R are P and Q. This is for the kernel K(x,A)=Q(A).
If R(dx,dy)= P(dx) K(x,dy) then we have
R(A \times F) = \int_A K(x,F) dP(x) = P(A) \quad \quad
R(E \times B) = \int_E K(x,B) dP(X)
so its marginals are P and and Q given by Q(B) = \int_E K(x,B) dP(x).
9.2 Lebsgue measure on \mathbb{R}^n and densities
In previous chapter we have constructed the Lebesgue probability measure P_0 on [0,1] as the unique measure such that P_0[(a,b]]=(b-a). Using this and other uniformly distributed random variables we define the Lebesgue measure on \mathbb{R} and on \mathbb{R}^n
Definition 9.3 The Lebesgue measure on \mathbb{R} is a set function m such that
The Lebsgue measure on \mathbb{R} is not a probability measure since m(\mathbb{R})=+\infty, it is an example of an infinite measure. We can easily construct it using uniform random variables on (n,n+1] with distribution P_{n,n+1} namely we set
m(A) = \sum_{n=-\infty}^\infty P_{n,n+1}(A)
The uniqueness of the measure P_{n,n+1} implies that the measure m is unique as well.
By doing a Fubini-Tonelli theorem type argument one can construct the Lebesgue measure on \mathbb{R}^n.
Definition 9.4 If we equip \mathbb{R}^n with the product \sigma-algebra \mathcal{B}\otimes \cdots \otimes \mathcal{B} the Lebesgue measure m_n on \mathbb{R^n} is the product of n Lebesgue measure on \mathbb{R}. We have
m_n\left( \prod_{i=1}^n [a_i,b_i]\right)= \prod_{i=1}^n (b_i-a_i)
Notations we often use the notation dx or dx_1 \cdots dx_n for integration with respect to m_n.
Definition 9.5 A probability measure on (\mathbb{R}^n,\mathcal{B}_n) (where \mathcal{B}_n=\mathcal{B} \otimes \cdots \otimes \mathcal{B}) has a density f if f is a nonnegative Borel measurable function and
P(A)\,=\, \int_A f(x) dx = \int 1_A(x) f(x) dx = \int f(x_1, \cdots, x_n) 1_A(x_1, \cdots, x_n) dx_1 \cdots dx_n
Theorem 9.2 A non-negative Borel measurable function f(x) is the density of a Borel probability measure if and only if \int f(x) dx =1 and it determines the probability P. Conversely the probability measure determines its density (if it exists!) up to a set of Lebesgue measure 0.
Proof. Given f\ge 0 it is easy to check that
P(A) = \int 1_A f(x) dx
defines a probbaility measure (same proof as in Exercise 7.3).
Conversely assume that f and f' are two densities for the measure P, then for any measurable set A we have P(A)=\int_A f(x) dx = \int_A f'(x) dx. Consider now the set
A_n =\left\{ x : f'(x)\ge f(x)+ \frac1n\right\}
Then we have
P(A_n)= \int_A f'(x) dx \ge \int_A \left(f(x) + \frac1n\right) dx = P(A) + \frac1n m(A)
and therefore m(A_n)=0 and since A_n increases to A=\{f < f'\} we have shown that m(\{f < f'\})=0. By symmetry we have f=f' a.s. \quad \square.
Theorem 9.3 Suppose the random variable (X,Y) has a probability distribution R(dx,dy) with density f(x,y). Then
Both X and Y have densities given respectively by
f_X(x)=\int f(x,y) dy\,,\quad \quad f_Y(y)=\int f(x,y) dx \,.
X and Y are independent if and only if
f(x,y) = f_X(x)f_Y(y)\,.
If we set
k(x,y)= \frac{f(x,y)}{f_X(x)} \text{ if } f_X(x) \not= 0
and this defines a kernel K(x,B)=\int_B k(x,y)dy and the probability distribution(X,Y) is given by R(dx,dy)=f_X(x)k(x,y)dxdy.
Remark It does not matter how K(x,B) is defined for such x where f_X(x)=0 and so we have left it undefined. There are in general many kernels densities which will give the same probability aloowing for changes on sets of zero probbaility.
Proof.
For A \in \mathcal{B} we have
P(X\in A) = P(X \in A, Y \in \mathbb{R}) = \int_A \left(\int_R f(x,y) dy\right) dx = \int_A f_X(x) dx
and since this holds for all A, f_X(x) is a density for the distribution of X.
If f(x,y)=f_X(x) f_Y(y) then
\begin{aligned}
P(X \in A, Y \in B)&= \int 1_{A \times B}(x,y) f_X(x) f_Y(y) dx dy = \int 1_{A}(x) f_X(x) dx \int 1_{B}(x) f_Y(y) dy \\
&=P(X\in A)P(Y\in B)
\end{aligned}
Conversely assume that X and Y are independent. Consider the collection of sets
\mathcal{H} =\left\{ C \in \mathcal{B}\otimes \mathcal{B} \,:\, \int_C f(x,y) dx dy = \int_C f_X(x) f_Y(y) dx dy\right\}
The independence and Fubini Theorem implies that any set C = A \times B belongs to \mathcal{H}. Since this is a p-system generating the \sigma-algebra the monotone class theorem shows that \mathcal{H}=\mathcal{B}\otimes \mathcal{B}.
We have
\int k(x,y) dy = \int \frac{f(x,y)}{f_X(x)} dy = \frac{f_X(x)}{f_X(x)} =1
and thus k(x,y) is a density. Furthermore we have
\begin{aligned}
P( X \in A, Y \in B) &= \int_A K(x,B) dP(x) = \int_A f_X(x) \left(\int_B k(x,y) dy \right) dx = \int_A f_X(x) \left(\int_B \frac{f(x,y)}{f_X(x)} dy \right) dx \\
& = \int_A \int_B f(x,y) dx dy
\end{aligned}
and this concludes the proof since the measure is uniquely determined by its value on rectangles (by a monotone class theorem arguemnt).
9.3 Example: Box Muller algorithms
We derive here a method to generate two independent normal random variables using two independent random number. This is a different algorithm that the one using the quantile for the normal RV which is known only numerically.
Theorem 9.4 Suppose that U_1 and U_2 are two independent random numbers then
\begin{aligned}
X_1 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2) \quad \quad \quad
X_2 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2)
\end{aligned}
are two independent normal random variables with mean 0 and variance 1.
Proof. We use the expectation rule Theorem 6.8 togther with polar coordinates. For any nonnegative function h(x_1,x_2) we have, using polar coordinate x_1 =r \cos\theta and x_2=r \sin \theta and then the change of variable s=r^2/2
\begin{aligned}
E[ h(X_1,X_2)] & = \int h(x_1,x_2) f(x_1,x_2) dx_1 dx_2 = \int_{\mathbb{R}^2} h(x_1,x_2) \frac{1}{2\pi} e^{-\frac{\|x\|^2}{2}} dx_1 dx_2 \\
& = \int_{[0,\infty]\times[0,2\pi]} h(r \cos\theta,r \sin\theta) \frac{1}{2\pi}d\theta r e^{-\frac{r^2}{2}}dr \\
& = \int_{[0,\infty]\times[0,2\pi]} h(\sqrt{2s} \cos\theta, \sqrt{2s} \sin\theta) \frac{1}{2\pi}d\theta e^{-s} ds
\end{aligned}
This computation shows that if S is exponential with parameter 1 and \Theta is uniform on [0,2\pi] then \sqrt{2 S}\cos(\Theta) and \sqrt{2 S}\sin(\Theta) are independent standard normal. But we can write also S=-\ln(U_1) and \Theta=2\pi U_2. \quad \square.
9.4 Exponential mixture of exponential is polynomial
Let us consider an exponential random variable Y whose parameter is itself an exponential random variable X with paramter \lambda>0. That is X has density
f(x)= \left\{
\begin{array}{cl}
\lambda e^{-\lambda x} & x \ge 0 \\
0 & \text{else}
\end{array}
\right.
and consider the kernel K(x,dy)=k(x,y)dy with
k(x,y) = \left\{
\begin{array}{cl}
xe^{-xy} & x>0, y>0 \\
0 & \text{ else }
\end{array}
\right.
Then (X,Y) has density
f(x,y)= \left\{
\begin{array}{cl}
\lambda e^{-\lambda x} x e^{-xy} = \lambda x e^{-(\lambda +y)x} & x>0, y>0 \\
0 & \text{ else }
\end{array}
\right.
Then, using that the mean of an exponential RV is the reciprocal of the paramter, the density of Y is
f(y) = \int_0^\infty f(x,y) dx = \frac{\lambda}{\lambda +y} \int_0^\infty x (\lambda +y) e^{-(\lambda +y)x}= \frac{\lambda}{(\lambda+y)^2}
which decays polynomially!
9.5 gamma and beta random variables
Recall that a gamma RV has density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\lambda x} for x \ge 0.
Consider now two independent gamma random RV X_1 and X_2 with paramters (\alpha_1, \beta) and (\alpha_2, \beta). We prove the following facts
Z=X_1 + X_2 is a gamma RV with parameters (\alpha_1+\alpha_2, \beta)
U=\frac{X_1}{X_1+X_2} is a beta distribution with paramters \alpha_1 and \alpha_2 which has the density
f(u) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2} \quad 0 \le u \le 1
X_1 + X_2 and \frac{X_1}{X_1+X_2} are independent RV
We use the expectation rule Theorem 6.8 and the change of variable z= x_1+x_2 and u= \frac{x_1}{x_1+x_2} or x_1= uz and x_2=(1-u)z. This maps [0,\infty) \times [0,\infty) to [0,1] \times [0,\infty) and the Jacobian of this transformation is equal z.
We have then for any nonnegative h
\begin{aligned}
E[h(Z,U)]&=E\left[h\left((X_1+X_2,\frac{X_1}{X_1+X_2}\right)\right]\\
&= \int_0^\infty \int_0^\infty h\left(x_1+x_2,\frac{x_1}{x_1+x_2}\right) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} x_1^{\alpha_1-1} x_2^{\alpha_2 -1} e^{- \beta(x_1 + x_2)} dx_1 dx_2 \\
& = \int_0^1 \int_0^\infty h(z,u) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} (uz)^{\alpha_1-1} ((1-u)z)^{\alpha_2-1} e^{-\beta z} z du dz \\
&= \int_0^1 h(z,u) \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du \int_0^\infty \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1+ \alpha_2)} z^{\alpha_1 + \alpha_2 -1} e^{-\beta z} dz
\end{aligned}
and this proves all three statements at once.
Remark This is a nice, indirect way, to compute the normalization for the density of the \beta distribution which is proportional to u^{\alpha_1-1}(1-u)^{\alpha_2-1}.
9.6 The beta binomial model and a whiff of Bayesiasn statistics
A beta random variable U has mean
E[U] = \int_0^1 u \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1 +1) \Gamma(\alpha_2)}{\Gamma(\alpha_1 +\alpha_2 +1)} = \frac{\alpha_1}{\alpha_1+\alpha_2}
and proceeding similarly one finds that
{\rm Var}(U)= \frac{\alpha_1 \alpha_2}{(\alpha_1+\alpha_2)^2 (\alpha_1 + \alpha_2+1)}
There is a natural connection between a binomial random variable (discrete) and the beta random variable (continuous) (let us call it P for a change). The pdf looks strangely similar
{n \choose j} p^j (1-p)^{n-j} \quad \text{ versus } \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 -1}
But p is a parameter for the binomial while it is a variable for the second!
Model:
Make n indepdent trials, each with a (random) probability P.
Take P to have a beta distribution with suitably chosen parameter \alpha_1,\alpha_2.
The mean \frac{\alpha_1}{\alpha_1+\alpha_2} is your average guess for the “true” probability p and by adjusting the scale you can adjust the variance (uncertainty) associated with your guess.
This leads to considering a random variable (X,P) taking value in \{0,1,\cdots,n\} \times [0,1] with a “density”
f(j,p) = \underbrace{{n \choose j} p^j (1-p)^{n-j}}_{=k(p,j) \text{ kernel }} \underbrace{\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1-1} (1-p)^{\alpha_2-1} }_{=f(p)}
which is normalized \sum_{j=0}^n \int_0^1 f(k,p) dp =1
The beta-binomial distribution with paramters (n,\alpha_1,\alpha_2) is the marginal distribution of X on \{0,1,\cdots,n\} given by
\begin{aligned}
f(j) = \int_0^1 f(j,p) dp & = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \int_0^1 p^{\alpha_1+n -1}(1-p)^{\alpha_2+n-k-1} \\
& = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1+j) \Gamma(\alpha_2+n-j)}{\Gamma(\alpha_1+\alpha_2+n)}
\end{aligned}
Bayesian statisitics framework
We interpret the distribution f(p) with parameter \alpha_1 and \alpha_2 as the prior distribution which describe our beliefs before we do any experiment. For example \alpha_1=\alpha_2=1 correspond to a uniform distribution on p (that is we are totally agnostic, a fine choice if you know nothing).
The beta-binomial which is the marginal f(j)=\int_0^1 f(j,p) dp describe the distribution of the independent trials under this model. It is called the evidence.
The kernel k(p,j) is called the likelihood which describes the number of success, j, given a certain probability of success, p. It is called the likelihood when we view it as a function of p and think of j as a parameter.
Now we can write the distribution using kernels in two ways:
f(j,p)= k(p,j)f(p) = k(j,p) f(j)
and the kernel k(j,p) is called the posterior distribution. It is interpreted as the distribution of p given that j trials have occured.
We can rewrite this (this is just a version of Bayes theorem) as
\text{posterior} = k(j,p) = \frac{k(p,j)f(p)}{f(j)} = \frac{\text{likelihood}\times\text{prior}}{\text{evidence}}\,.
For the particular model at hand
k(j,p) \propto p^{j} (1-p)^{n-j} p^{\alpha_1-1} (1-p)^{\alpha_2-1} \propto p^{\alpha_1+j-1} (1-p)^{\alpha_2+ n- j-1}
and therefore k(j,p) has a binomial distribution with parameter \alpha_1 +j and \alpha_2+n-j.
This is special: the prior and posterior distribution belong to the same family. We say that we have conjugate priors and this is a simple model to play. with. In general we need Monte-Carlo Markov chains to do the job.
Example: Amazon seller ratings You want to buy a book online
Vendor 1 has 151 positive rating and 7 negative rating (95.5%).
Vendor 2 has 946 positive ratings and 52 negative ratings (94.7%).
Uniform prior with \alpha_1=\alpha_2=1 gives two beta posterior with \alpha_1=152 and \alpha_2=8 and \alpha_1=947 and \alpha_2=53.
9.7 Exercises
Exercise 9.1
Suppose that X is a gamma RV with parameter \alpha and \beta and Z is an exponential RV with parameter 1 and suppose further that Z and X are independent. Find the CDF (and then the PDF) of Y=\frac{Z}{X}.
Proceeding as in Section 9.4 consider an exponential RV Y whose parameter X has a gamma distribution with paramters \alpha and \beta. Find the marginal distribution of Y.
Compare 1. and 2. and explain why they give the same result.
Exercise 9.2 (Poisson-Gamma model) Consider a Poisson RV X with a random parameter \Lambda which itself has a gamma distribution (for some parameters (\alpha,\beta).)
What are the joint density and the kernel, f(j,\lambda)=k(\lambda,j) f(\lambda), for the pair (X,\Lambda).
What is the density f(j) of X? (the “evidence”)?
What is the “posterior distribution” (that is what is the kernel k(j,\lambda) if we write f(j,\lambda)= k(j,\lambda)f(j))?
10 The characteristic and moment generating function for a random variable
One of the oldest trick in the mathematician toolbox is to obtain properties of a mathemtical object by perfomring a transformation on that object to map it into another space. In analysis (say for ODE’s and PFE’s) the Fourier transform and the Laplace transform play avery important role. Both play an equally important role in probability theory!
Fourier transform of a probbaility measure leads to a proof of the central limit theorem!
Laplace transform (via Chernov bounds) leads to concentration inequalities and performance guarantees for Monte-Carlo methods and statistical learning.
10.1 Fourier transform and characteristic function
Notation For vectors x,y \in \mathbb{R}^n we use the notation
\langle x, y\rangle = \sum_{i=1}^n x_i y_i
for the scalar product in \mathbb{R}^n
Definition 10.1 (Fourier transform and characteristic function)
For a probability measure P on \mathbb{R}^n the Fourier transform of P is a function \widehat{P}(t): \mathbb{R}^n \to \mathbb{C} given by
\widehat{P}(t) = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP(x)
For a random variable X taking value in \mathbb{R}^n the characterictic function of X is the Fourier transorm of P^X (the distribution of X): we have
\phi_X(t)= E\left[e^{i\langle t,x \rangle}\right] = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP^X(x)
Remarks
We have not talked explicitly about integration of complex valued function h=f+ig where f and g are the real and imaginary part. It is simply defined as
\int (f+ig)dP = \int f dP + i \int g dP
provided f and g are integrable. A complex function h is integrable iff and only if |h| is intergable if and only if f and g are integrable. (The only thing a bit hard to prove is the triangle inequality |\int h dP|\le \int|h| dP.)
The function
e^{i\langle t,x \rangle} = \cos(\langle t,x \rangle) + i \sin (\langle t,x \rangle)
is integrable since \sin(\langle t,x \rangle) and \cos(\langle t,x \rangle) are bounded function (thus in L^1 \subset L^\infty) or by noting that |e^{i\langle t,x \rangle}|=1.
Suppose the measure P has a density f(x) then we have
\widehat{P}(t)= \int e^{i\langle t,x \rangle} f(x) dx
which is simply the Fourier transform (usually denoted by \widehat{f}(t)) of the function f(x). (Diverse conventions are used for the Fourier, e.g. using e^{-i 2\pi\langle k,x \rangle} instead e^{i\langle t,x \rangle})
10.2 Analytic properties of the Fourier transform
We turn next to the properties of the Fourier transform. A very useful thing to remember of the Fourier transform
\text{ the smoother the Fourier transform is the faster the function (or the measure) decay and vice versa}
The next two theorem makes this explicit in the context of measures. The first one is very general and simply use that we dealing with probaility measure
Theorem 10.1 The Fourier transform \widehat{P}(t) of a probability measure P is uniformly continuous on \mathbb{R}, and satisfies \widehat{P}(0)=1 and |\widehat{P}(t)|\le 1.
Proof. Clearly \widehat{P}(0)= \int_{\mathbb{R}^n} dP(x)=1 since P is a probability measure and since |e^{i\langle t,x \rangle}|=1, by the triangle inequality |\widehat{P}(t)|\le 1.
For the uniform contintuity we have
\begin{aligned}
|\widehat{P}(t+h) -\widehat{P}(t)| \le \int |e^{i\langle (t+h),x \rangle} - e^{i\langle t,x \rangle} | dP(x) =
\int |e^{i\langle t,x \rangle}| | e^{i\langle h,x \rangle} - 1 | dP(x) = \int | e^{i\langle h,x \rangle} - 1 | dP(x)
\end{aligned}
The right hand side is independent of t which is going to show uniformity. To conclude we need to show that the right hand-side goes to 0 as h \to 0. We can use dominated convergence since
\lim_{h \to 0} e^{i\langle h,x \rangle} =1 \text{ for all } x \quad \quad \text{ and } \quad \quad | e^{i\langle h,x \rangle} - 1 | \le 2 \quad \square
As we have seen in Chapter 7 the L^p spaces are form a dcereasing sequence L^1 \supset L^2 \supset L^3 \supset \cdots L^\infty and the next theorem show that if some random variable belongs to L^n (for some integer n) then its chararcteristic function will be n-times continuously differentiable.
Theorem 10.2 Suppose X is RV taking value in \mathbb{R}^n and is such that E[|X|^m] < \infty. Then the characteristic function \phi_X(t)=E[e^{i\langle t, X \rangle}] has continuous partial derivative up to oder m, and for any k \le m,
\frac{\partial^k\phi_X}{ \partial{x_{i_1}}\cdots \partial{x_{i_k}} }(t) = i^k E\left[X_{i_1} \cdots X_{i_k} e^{i\langle t, X \rangle}\right]
Proof. We will prove only the m=1 case, the rest is proved by a tedious induction argument. Denoting by e_i the basis element
\begin{aligned}
\frac{\partial\phi_X}{ \partial x_i}(t) = \lim_{h \to 0} \frac{\phi_X(t+he_i)- \phi_X(t)}{h} = \lim_{h \to 0} E \left[ \frac{1}{h}\left(e^{i\langle t+ he_i, X \rangle} - e^{i\langle t, X \rangle}\right) \right] = \lim_{h \to 0}
E \left[ e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h} \right]
\end{aligned}
To exchange the limit and expectation we use DCT argument and the bound
|e^{i\alpha} -1| = \left| \int_0^\alpha \frac{d}{ds} e^{is} \right| \le \int_0^\alpha |i e^{is}|\le |\alpha|\,.
From this we see that \left|e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h}\right| \le |X_i| which is integrable and independent of h. The DCT concludes the proof. \quad \square
10.3 More properties
Two simple but extremly useful properties:
Theorem 10.3 If X takes value in \mathbb{R}^n, b \in \mathbb{R}^m and A is a m\times n matrix then
\phi_{AX+b}(t) = e^{i \langle t, b\rangle}\phi_{X}(A^Tt)
Proof. This simply follows from the equality
e^{i \langle t, AX +b \rangle} = e^{i \langle t,b \rangle} e^{i \langle A^T t,X \rangle}\,. \quad \square
Theorem 10.4 Suppose X and Y are independent RV taking values in \mathbb{R}^n then
\phi_{X+Y}(t)=\phi_{X}(t) \phi_Y(t)
Proof. By independence
E\left[ e^{i \langle t,X+Y \rangle} \right] =E\left[ e^{i \langle t, X \rangle} e^{i \langle t, Y \rangle}\right] =E\left[ e^{i \langle t, X \rangle}\right]E\left[ e^{i \langle t,Y \rangle}\right] \quad \square
Normal with paramters \mu, \sigma^2: We start with the special case \mu=0 and \sigma^2=1 and we need to compute the complex integral
\phi_X(t)=E[e^{itX}]= \frac{1}{\sqrt{2\pi}}\int e^{itx} e^{-x^2}{2} dx
You can do it via contour integral and residue theorem (complete the square!). Instead we use an ODE’s argument.
First we note that by symmetry
\phi_X(t)= \frac{1}{\sqrt{2\pi}}\int \cos(tx) e^{-x^2/2} dx
By Theorem 10.2 we can differentiate under the integral and find after integrating by part
\phi_X'(t)=\frac{1}{\sqrt{2\pi}}\int -x \sin(tx)e^{-x^2/2} dx = \frac{1}{\sqrt{2\pi}}\int -t \cos(tx)e^{-x^2/2} dx =-t \phi_X(t)
a separable ODE with initial condition \phi_X(0)=1. The solution is easily found to be e^{-t^2/2} and thus we have
\phi_X(t)=E\left[e^{itX}\right]= e^{-t^2/2}.
Finally noting that Y=\sigma X +\mu is a normal is with mean \mu and \sigma we find by Theorem 10.3 that
\phi_Y(t)=e^{i\mu t}E\left[e^{i\sigma t X}\right]= e^{i\mu t -\sigma^2t^2/2}.
Exponential with parameter \beta: For any (complex) z we have \int_a^b e^{z x} dx = \frac{e^{z b} - e^{za}}{z}. From this we deduce that
\phi_X(t) = \beta \int_0^\infty e^{(it-\beta)x} dx = \frac{\beta}{\beta-it}
10.5 Uniqueness Theorem
We show now that the Fourier transform determines the probability measures uniquely, that is if two probability measures P and Q have the same Fourier transforms \widehat{P}=\widehat{Q} then they must coincide P=Q. For simplicity we only consider the 1-d case but the proof extends without problem.
There exists several version of this proof (see your textbook for one such proof). We give here a direct proof which also gives an explcit formula on how to reconstruct the measure from the its Fourier transform.
Our proof relies on the following computation of the so-called Dirichlet integral
Lemma 10.1 (Dirichlet integral) For T > 0 let S(T)= \int_0^T \frac{\sin{t}}{t} \, dt. We have then
\lim_{T \to \infty} S(T) = \frac{\pi}{2}
This is a fun integral to do and can be done using a contour integral in the complex plane, or by a Laplace transform trick, or by the so-called Feynmann trick (add a parameter and differentiate). See for example the Wikipedia page.
We will take this result for granted and note that we have
\int_0^T \frac{\sin(\theta t)}{t} dt = \rm{sgn}(\theta) S( |\theta| T)
where {\rm sgn}(\theta) is +1,0 or -1 is \theta is positive, 0, or negative.
Theorem 10.5 (Fourier inversion formula) If a and b are not atoms for P we have
P((a,b])= \lim_{T \to \infty} \frac{1}{2\pi} \int_{-T}^T \frac{e^{-ita} - e^{-itb}}{it} \widehat{P}(t) \, dt
\tag{10.1} In particular distinct probability measures cannot have the same characteristic function.
Proof. The inversion formula implies uniqueness. The collections of (a,b] such that a and b are not atoms is a p-system which generates \mathcal{B} so the monotone class theorem implies the result, see Theorem 3.2. (See exercise for more on atoms).
Let I_T denote the integral in Equation 10.1. Using the bound |e^{iz}-e^{iz'}|\le |z-z'|s we see that that the integrand is bounded and thus by Fubini’s theorem we have
I_T = \frac{1}{2\pi} \int_{-\infty}^\infty \left( \int_{-T}^T \frac{e^{it(x-a)} - e^{it(x-b)}}{it} \, dt \right) dP(x)
Using Euler formula and the fact that \cos is even \sin is odd we find
\begin{aligned}
I_T &= \int_{-\infty}^\infty \left( \int_{0}^T \frac{ \sin(t(x-a)) -\sin(t(x-b)) }{\pi t} \, dt \right) dP(x) \\
&= \int_{-\infty}^\infty \frac{{\rm sgn}(x-a)}{\pi} S(T|x-a|)-\frac{{\rm sgn}(x-b)}{\pi} S(T|x-b|) dP(x)
\end{aligned}
\tag{10.2}
The integrand in Equation 10.2 is bounded and converges as T \to \infty to the function
\psi_{a,b}(x)=
\left\{
\begin{array}{cl}
0 & x < a \\
\frac{1}{2} & x=a \\
1 & a < x < b \\
\frac{1}{2} & x=b \\
0 & x >b \\
\end{array}
\right.
By DCT we have that I_T \to \int \psi_{a,b} dP = P((a,b]) if a and b are not atoms. \quad \square
You can use the Fourier inversion formula to extract more information.
Theorem 10.6 Suppose the Fourier transform \widehat{P}(t) is integrable, \int |\widehat{P}(t)| dt < \infty then P has a density f(x)=\int e^{-itx} \widehat{P}(t) dt.
Proof. Using that |\frac{e^{-ita} - e^{-itb}}{it}|\le |b-a| the fact that |\widehat{P}(t)| is integrable means we can extend the integral in Equation 10.1 to an integral from -\infty to \infty. As a consequence we get P((a,b)) \le |b-a| \int_{-\infty}^\infty |\widehat{P}(t)| \, dt and thus P has no atoms. Furthermore for the CDF F(t) of P we have for h negative or positive
\frac{F(x+h) - F(x)}{h} =\frac{1}{2\pi}\int_{-\infty}^\infty \frac{e^{-itx} - e^{-it(x+h)}}{ith} \widehat{P}(t) dt
The integrand is dominated by |\widehat{P}(t)| and by DCT F is differentiable and P has density F'(x)= f(x)= \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx} \widehat{P}(t) dt. \quad \square
10.6 Examples
We can use the inversion theorem in creative ways.
Examples:
The Laplace RV Y is a two-sided version of the exponential RV. Its density is f(x) = \frac{\beta}{2}e^{-\beta|x|}. You can think of the Laplace distribution as the mixture (with mixture parameters \frac12, \frac12) of an exponential RV X and -X where X is exponential. Its characteristic function is then
\phi_Y(t)= E[e^{itY}]= \frac{1}{2}E[e^{itX}] + \frac{1}{2}E[e^{-itX}] = \frac{1}{2} \frac{\beta}{\beta -it} + \frac{\beta}{\beta +it} =\frac{\beta^2}{\beta^2 + t^2}
The Cauchy RV Z had density f(x)=\frac{\beta}{\pi(x^2 + \beta^2)} and its characteristic function is given by
\phi_Y(t)= E[e^{itY}]= \int_{\infty}^\infty e^{itx}\frac{\beta}{\pi(x^2 + \beta^2)}
a priori not an easy integral. However notice that the Fourier transform of the Laplace looks (up to constants) exactly like the density of a Cauchy! So we using Theorem 10.6 for the Laplace distribution shows that
\frac{\beta}{2} e^{-\beta|x|} = \frac{1}{2\pi}\int e^{-itx} \phi_Y(t) dt = \frac{1}{2\pi}\int e^{-itx} \frac{\beta^2}{\beta^2 + t^2}dt = \frac{\beta}{2} \int e^{itx} \frac{\beta}{\pi(\beta^2 + t^2)}dt =\frac{\beta}{2} \phi_Z(x)
from which conclude that \phi_Z(t)= e^{-\beta|t|}.
10.7 Sum of independent random variables
Suppose X and Y are independent random variables. We wish to understand what is the distribution of X+Y. The first tool is to use the characteristic function and the fact that if X and Y are independent
E[ e^{it(X+Y)}] = E[e^{itX}] E[e^{itY}]
together with the uniquness theorem.
Examples
Suppose X is normal with paramter \mu and \sigma^2 and Y normal with paramter \nu and \eta^2. Then if X and Y are independent then X+Y is normal with paramter \mu+\nu and \sigma^2 + \eta^2. This follows form the uniqueness theorem and
E[e^{it(X+Y)} ] = E[e^{itX}] E[e^{itY}] = e^{i\mu t - \sigma^2t^2/2} e^{i\nu t - \eta^2t^2/2} = e^{i(\mu+\nu) t - (\sigma^2+\eta^2) t^2/2}
Suppose X_1, \cdots, X_n are independent Bernoulli RV with paramters p then X_1+ \cdots + X_n is a binomial RV. Indeed we have
E[e^{it(X_1+\cdots+X_n)} ]= E[e^{itX_1}] \cdots E[e^{itX_n}] =(e^{it}p +(1-p))^n
Another tool is the following convolution theorem
Theorem 10.7 (Convolution of probability measures) Assume X and Y are independent random variables.
If X and Y have distribution P^X and P^Y then X+Y has the distribution
P^X \star P^Y (A) = \int \int 1_A(x+y) dP^X(x) dP^Y(y) \quad \text{ convolution product}
If X and Y have densities f_X(x) and f_Y(y) then X+Y has the density
f_{X+Y}(z)= \int f_X(z-y) f_Y(y) dy = \int f_X(x) f_Y(z-x) dx
Proof. For the first part let us take a non-negtive function h and set Z=X+Y. We have then
E[h(Z)]= E[h(X+Y)]= \int h(x+y) P^X(dx) P^Y(dy)
Taking h=1_A give the result.
For the second part if X and Y have a density we have
\begin{aligned}
P(Z \in A) =E[1_A(Z)] & = \int \int 1_A(x+y) f_X(x) f_Y(y) dx dy \\
& = \int \int 1_A(z) f_X(z-y) f_Y(y) dz dy \quad \text{ change of variables } z=x+y, dz = dx \\
&= \int \left(\int f_X(z-y) f_Y(y) dy\right) 1_A(z) dz \quad \text{ Fubini }
\end{aligned}
and since this holds for all A, Z has the claimed density. The second formula is proved in the same way. \quad \square.
Example: triangular distribution
Suppose X and Y are independent and uniformly distributed on [-\frac{1}{2},\frac12]. Then Z=X+Y is between -1 and +1 for z \in [-1,1] we have
f_Z(z)=
\int_{-\infty}^\infty f_X(z-y) f_Y(y) dy =
\left\{
\begin{array}{cl}
\int_{-\frac12}^{z+ \frac12} dy & -1\le z \le 0 \\
\int_{z-\frac{1}{2}}^{\frac12} dy & 0 \le z \le 1
\end{array}
= 1- |z|
\right.
10.8 Moment and moment generating functions
The moment problem is the question whether a probability distribution P is uniquely determined by all its moments E[X^n]. It is in general not true as the following examples shows.
Recall the log-normal distribution is the distribution of e^X is X is a normal distribution. Its pdf is given, for \mu=0 and \sigma^2=1
f(x)= \frac{1}{x \sqrt{2 \pi}} e^{- \frac{\ln(x)^2}{2}}
and all moments exists E[X^r]=\int_0^\infty x^k f(x) dx = e^{r^2/2}.
Now consider
g(x) = f(x)(1 + \sin(2 \pi \ln(x)))
Then for k=0,1,2, \cdots we have with the change of variable \ln(x)= s+k
\int_0^\infty x^k f(x) \sin(2 \pi \ln(x)) dx = \frac{1}{\sqrt{2 \pi}} e^{k^2/2}\int_{-\infty}^\infty e^{-s^2/2} \sin(2 \pi s) ds =0 \,.
This shows that g is the density of a RV Y and that all moments of Y coincide with the moments of the log-normal!
A stronger condition on the moments do imply uniqueness: if all moments exists and E[X^n] do not grow to fast with n then the moments do determine the distribution. This use a analytic continuation argument and relies on the uniqueness theorem for the Fourier transform.
Theorem 10.8 Suppose X and Y are RV such that the moment generating functions M_X(t)=M_Y(t) in [-t_0,t_0] and are finite in that interval. Then X and Y have the same distribution.
Proof.
Since e^{t|x|} \le e^{tx}+ e^{-tx} and the right hand-side is integrable, the function e^{t|X|}=\sum_{k=0}^\infty \frac{s|X|^k}{k!} is integrable. By the DCT (for sums of RVs, in the form of Exercise 7.2) we have E[e^{tX}]=\sum_{k=0}^\infty \frac{t^k E[X^k]}{k!}.
This implies that \frac{t^k E[X^k]^k}{k!} \to 0 as k \to \infty (for |t|\le t_0). We claim that this implies that \frac{s^k E[|X|^k]^k}{k!}\to 0 as k \to \infty as long as s < t_0. If k is even E[|X|k]=E[X^k] and there is nothing to do. For k odd we use on one hand that |X|^{2k-1} \le 1 + |X|^2k as well that s< t we have (for k sufficiently large) 2k s^{2k-1} < t^2k. Together this shows
\frac{s^{2k-1} E[|X|^k]^k}{k!} \le \frac{t^{2k} E[X^k]^k}{k!} \quad \text{ for $k$ large enough.}
The next piece is the Taylor expansion theorem with reminder for function which are n-times continuously differentiable
f(x)= \sum_{k=0}^n f^{(k)}(x_0) \frac{(x-x_0)^k}{k!} + \int_{x_0}^x \frac{f^{(n+1)}(t)}{n!}(x-s)^n dt
from which we obtain
\left|e^{itx}\left( e^{ihx} - \sum_{k=0}^n \frac{(ix)^k}{k!}\right)\right| \le \frac{|h x|^{n+1}}{(n+1)!}
Integrating with respect to P and taking n \to \infty together with Theorem 10.2 gives
\phi_X(t+h)= \sum_{k=0}^\infty \phi_X^{(k)}(t)
\tag{10.3} for |h| < t_0.
Now suppose X and Y have the same moment generating function in a neighborhood of 0. Then all their moments coincide and thus by Equation 10.3 (with t=0), \phi_X(t)=\phi_Y(t) on the interval (-r_0, r_0). By Theorem 10.2 their derivatives must also be equal on (-t_0,t_0). Using now Equation 10.3 (with t=-t_0+\epsilon and t=t_0-\epsilon shows that \phi_X(t)=\phi_Y(t) on the interval (-2t_0, 2t_0). Repeating this argument \phi_X(t)=\phi_Y(t) for all t and thus by Theorem 10.5X and Y must have the same distribution. \quad \square
10.9 Exercises
Exercise 10.1
Show that a characteristic function \phi_X(t) satifies \overline{\phi_X(t)}=\phi_X(-t) (complex conjugate).
Show that a characteristic function \phi_X(t) is real if and only if the random variable X is symmetric (i.e X and -X have the same distribution)
Show that if \phi_X is the characteristic function for some RV X then \phi_X^2(t) and |\phi_X(t)|^2 are characteristic function as well. What are the corresponding RVs?
Exercise 10.2 In this problem we study the characteristic function for a Gamma random variable with parameter \alpha and \beta and density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}. In particular you will prove that \phi_X(t)=\left(\frac{\beta}{\beta-it}\right)^\alpha.
First show that it is enough to consider the case \beta=1 (change scale.)
Use the moments E[X^n] to show that \phi_X(t)= \sum_{n=0}^\infty (it)^n \frac{\Gamma(\alpha+n)}{\Gamma(\alpha) n!} and use then the binomial series for (1-it)^{-\alpha}.
Use your result to show that
The sum of two independent gamma random variables with parameters (\alpha_1,\beta) and (\alpha_2,\beta) is a gamma random variable.
If X_1, X_2, \cdots, X_n are independent normal random variable with mean 0 and variance \sigma^2. show that X_1^2 + \cdots + +X_n^2 is a gamma random variable and find the parameters.
Exercise 10.3 Show that if X and Y are RV values taking in the positive integers with distributions P^X(n) and P^Y(n) and are independent then X+Y has distribution P^{X+Y}(n)=\sum_{k=0}^k P^X(k) P^Y(n-k) (this is called the convolution product of the two sequences P^X and P^Y).
Exercise 10.4
In Theorem 10.5, modify the statement of the theorem if a or b are atoms.
Show that
P(\{a\}) = \lim_{T \to \infty} \frac{1}{2T}\int_{-T}^T e^{-ita} \widehat{P}(t) dt
Hint: Imitate the proof of the inversion formula in Theorem 10.5
Suppose a RV X takes integer values in \mathbb{Z}, show that
P(X=n)= \frac{1}{2\pi}\int_{-\pi}^{\pi} e^{-itn} \phi_X(t) dt
Hint: Show that \phi_X(t) is periodic
11 Convergence of random variables
We think of a (real-valued) random variable as a function X(\omega) and so if we have a sequence of RVs \{ X_n\} we can define various types of convergence.
Convergence almost sure
Convergence in probability
Convergence in L^p
In subsequent chapters we are will study another type convergence, namely weak convergence (also called convergence in distribution). It is of quite different type because it based on on the distribution of X and not on the notion of random variable as a function.
11.1 Almost sure convergence
Definition 11.1 (Almost sure convergence) A sequence of RVs \{X_n\}converges almost surely to a RV X if
\lim_{n} X_n(\omega) = X(\omega) \text{ a.s}
that is P(\{\omega\,:\, \lim_{n} X_n(\omega) = X(\omega)\})=1.
Almost sure convergence is pointwise convergence, the limit is unique if of course we identify RV which are equal a.s.
It will be very useful to rephrase almost sure convergence in a different way. At first sight it looks a bit strange but it is a good idea. We explain the idea first for a sequence \{x_n\} of numbers. Consider the function defined for any \epsilon>0
i_\epsilon(x) = 1_{(\epsilon,\infty)}(x) = \left\{
\begin{array}{cl}
1 & x > \epsilon \\
0 & x \le \epsilon
\end{array}
\right.
Lemma 11.1
A sequence \{x_n\} converges to x if and only if \sum_n i_\epsilon(|x_n-x|) < \infty for every \epsilon.
If there is a non-negative sequence \epsilon_n such that \sum_{n}\epsilon_n <\infty and \sum_{n} i_\epsilon(|x_n-x_{n+1}|) < \infty then x_n converges to some x.
Proof.
Fix \epsilon>0. If x_n \to x then there exists N such that for any n \ge N, |x_n-x|\le \epsilon which means i_\epsilon(|x_n-x|) =0 for n \ge N and thus \sum_n i_\epsilon(|x_n-x|) < \infty. Conversely if \sum_n i_\epsilon(|x_n-x|) < \infty then, since the terms are either 0 or 1, only finitely many terms can be nonzero and thus there exists N such that |x_n-x| \le \epsilon for n \ge N. \quad \square
If this holds there exists N such that |x_n-x_{n+1}|\le \epsilon_n for n \ge N. Taking n>m \ge N gives
|x_n-x_m| \le |x_n-x_{n-1}|+ \cdots + |x_{m+1}-x_m| \le \epsilon_m + \cdots + \epsilon_n \le \sum_{j=m}^\infty \epsilon_j
Since the the sequence \epsilon_n is summable \sum_{j=m}^\infty \epsilon_j goes to 0 as m \to \infty. Therefore the sequence x_n is a Cauchy sequence and thus x_n \to x.
Returning to random variables we find
Theorem 11.1 The sequenceof RV X_n converges almost surely if and only if, for every \epsilon>0,
\sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \quad \text{ almost surely}
\tag{11.1}
Proof.
Let \Omega_0 the set on which convergence holds, then the sum in Equation 11.1 converges for \omega \in \Omega_0 (which is independent of \epsilon).
For the converse the only small issue to deal with is that the set on which the sum converges may depend on \epsilon. So pick a sequence \epsilon_k \searrow 0 and let N_k the value of the sequence in Equation 11.1 and we have P(N_k < \infty)=1. Since \epsilon_{k+1}\le \epsilon_k, i_{\epsilon_{k+1}}\ge i_{\epsilon_{k}} and so N_{k+1}\ge N_k. The events \{N_k < \infty\} are shrinking to
\Omega_0=\{ \omega\,:\, \sum_{n} i_{\epsilon}\circ|X_n-X| < \infty \text{ for all } \epsilon >0 \}
By sequential continuity P(\Omega_0)=\lim_{k} P(N_k < \infty)=1 and thus X_n converges to X almost surely. \quad \square.
11.2 Borel-Cantelli Lemmas
The Borel-Cantelli Lemma is very simple, and very useful
Lemma 11.2 (Borel Cantelli Lemma) Suppose B_n is a collection of events. Then
\sum_n P(B_n) < \infty \implies \sum_{n} 1_{B_n} < \infty \text{ almost surely}\,.
Proof. By the monotone convergence (applied to sum) E[\sum_{n} 1_{B_n}]=\sum_{n} E[1_{B_n}] =\sum_n P(B_n) < \infty. This implies that \sum_{n} 1_{B_n}< \infty almost surely.
There is a converse to Borel-Cantelli Lemma. To a set B we can associate a Bernoulli RV X=1_{B} and vice versa. The Kolmogorov 0-1 law tells us that if the B_i (and so the X_i) are independent then either \sum_n {X_n}<\infty or \sum_n {X_n}=\infty almost surely. Borel-Cantelli Lemma shows that (even without independence assumption) the fist case occurs. The next results shows a partial converse
Lemma 11.3 (Borel Cantelli Lemma) Suppose X_n are Bernoulli random variables.
If \sum_n E[X_n] < \infty then \sum_{n}X_n <\infty almost surely.
If \sum_n E[X_n] = \infty and the X_n are pairwise independent then \sum_{n} X_n =\infty almost surely.
Proof. The first part is just the Borel-Cantelli Lemma Lemma 11.3. The proof of the second part is subtle. We let S_n=X_1+ \cdots + X_n and S=\lim_{n}S_n. Using the pairwise independence we have
{\rm Var}(S_n) = \sum_{i=1}^n {\rm Var}(X_i) = \sum_i p_i (1-p_i) \le \sum_{i=1}^n p_i
where p_i=E[X_i].
We set further a_n = \sum_{i=1}^n p_i=\sum_{i=1}^n E[X_i]. Then by assumption a_n \nearrow \infty. This implies that for any b>0 we also have a_n - \sqrt{b a_n} \nearrow \infty and so \{ S \le a_n - \sqrt{b a_n} \} \nearrow \{S < \infty\}.
Since S_n \le S we have
\{ S \le a_n - \sqrt{b a_n} \} \subset \{ S_n \le a_n - \sqrt{b a_n} \} \subset \{ |S_n -a_n| >\sqrt{b a_n} \}
Hence, by sequential continuity and Chebyshev inequality,
\begin{aligned}
P( S < \infty) & = \lim_{n} P( S \le a_n - \sqrt{b a_n}) \le \limsup P(S_n \le a_n - \sqrt{b a_n}) \le \limsup P(|S_n -a_n| >\sqrt{b a_n}) \\
&\le \limsup_{n} \frac{{\rm Var(S_n)}}{b a_n} \le \frac{1}{b}
\end{aligned}
Since b is arbitrary S=\infty almost surely.
From the Borel-Cantelli Lemma we obtain useful criteria for almost sure convergence.
Theorem 11.2
If, for any \epsilon> 0, \sum_{n} P(|X_n-X|\ge \epsilon) < \infty then X_n \to X almost surely.
If there exists a sequence \epsilon_n \searrow 0 such that \sum_{n} P(|X_n-X|\ge \epsilon_n) < \infty, then X_n \to X almost surely.
If there exists a sequence \epsilon_n with \sum_{n}\epsilon_n <\infty and \sum_{n} P(|X_n-X_{n+1}|\ge \epsilon_n) < \infty, then X_n \to X almost surely.
Proof.
By Borel-Cantelli we have \sum_{n} i_\epsilon(|X_n-X|) < \infty a.s. which means a.s convergence.
By the Borel-Cantelli Lemma we have |X_n-X| \le \epsilon_n for all but finitely many n. almost surely. Since \epsilon_n \to 0 this means that X_n converges to X a.s
By Lemma 11.1 and Borel-Cantelli \{X_n\} is a Cauchy sequence a.s. and thus converges a.s.
11.3 Convergence in Probability
Definition 11.2 (Almost sure convergence) A sequence of RVs \{X_n\}converges in probability to a RV X if, for any \epsilon> 0,
\lim_{n \to \infty} P( \{\omega\,:\, |X_n(\omega) - X(\omega)|> \epsilon\}) =0 \,.
This is a very useful mode of convergence in probability, in particular due to the fact that, that it is weaker than almost sure convergence and thus easier to prove.
Example Let \Omega=[0,1) and P_0 be Lebesgue measure. Consider the sequence of RV
X_1 = 1_{[0,\frac12)}, X_2 = 1_{[\frac12,1)}, X_3 = 1_{[0,\frac13)}, X_4 = 1_{[\frac13,\frac23)}, X_5 = 1_{[\frac23,1)},, X_6 = 1_{[0,\frac14)},X_7 = 1_{[1/4,\frac24)} \cdots
\tag{11.2}
We claim that X_n converges to 0 in probability. Indeedm for any \epsilon>0, P( X_n > \epsilon ) =P(X_n=1) \to 0 since the X_n=1_{I_n} is a charactersitic function of an interval I_n whose length goes to 0 as n goes to infinity.
The sequence X_n does not converge a.s. Indeed \omega \in [0,1) belong to infinitely many intervals of the form [\frac{k}{n},\frac{k+1}{n}) and does not belong to infinitely many such intervals. Therefore \liminf X_n(\omega)=0 \not= \limsup X_n(\omega)= 1.
What we can see also is that the sequence X_n has (many!) convergent subsequence which converges to 0 almost surely. To do this take pick n_k such that the interval I_{n_k} for X_{n_k}=1_{I_{n_k}} is contained in the interval I_{n_{k-1}} for X_{n_{k-1}}= 1_{I_{n_{k-1}}}.
The relation between almost sure convergence and convergence in probability is contained in the following theorem.
The third part, while looking somewhat convoluted is very useful to streamline subsequent proofs. It relies on the following simple fact: suppose that the sequence x_n is such that every subsequence has a subsubsequence which converges to x then x_n converges to x.
Theorem 11.3 (Almost sure convergence versus convergence in probability)
If X_n converges almost surely to X then X_n converges in probability to X.
If X_n converges in probability to X then there exists a subsequence X_{n_k} which converges to X almost surely,
If every subsequence has a further subsubsequence which converges to X almost surely, then X converges to X in probability.
Proof.
If X_n converges to X almost surely then i_\epsilon(|X_n-X|) converges to 0 almost surely. By the bounded convergence theorem this implies E[i_\epsilon(|X_n-X|)]=P(|X_n-X|\ge \epsilon) converges to 0.
If X_n converges to X in probability then let us pick \epsilon_k =\frac{1}{k} \searrow 0. Since P(|X_n(\omega) - X(\omega)|> \epsilon_k) \to 0 as n \to \infty we can find n_k such that P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \frac{1}{2^k}. Therefore
\sum_{k=1}^\infty P( |X_{n_k}(\omega) - X(\omega)|> \epsilon_k) \le \sum_{k} \frac{1}{2^k} < \infty \,.
By part 2. of Theorem 11.2 this shows that X_{n_k} converges almost surely to X.
Assume that every subsequence of X_n has a sub-subsequence which converges to X. Fix \epsilon>0 and consider the numerical sequence p_n(\epsilon)=P(|X_n-X| \ge \epsilon). Since this sequence is bounded, by Bolzano-Weierstrass theorem, let p_{n_k} be a convergent subsequence with \lim_{k} p_{n_k} =p. By assumption X_{n_k} has a convergent subsequence X_{n_{k_j}} which converges to X almost surely. This implies, by part 1., that p_{n_{k_j}} converges to 0. This means that for the sequence p_n, every convergent subsequence has a subsubsequence which converges to 0. This implies that p_n converges to 0 and thus X_n converges to X in probability. \quad \square.
Based on this we obtain the following continuity theorem
Theorem 11.4 (Continuity theorem for convergence in probability)
If X_n converges to X almost surely and f is continous function then f(X_n) converge to f(X) almost surely.
If X_n converges to X in probability and f is continous function then f(X_n) converge to f(X) in probability.
Proof. Part 1. is just the definition of continuity.
For part 2. suppose X_n converges to X in probability. Then, by Theorem 11.3, there exists a subsequence X_{n_k} which converges almost surely which implies, by part 1, that f(X_{n_k}) converges to f(X) almost surely.
Now we apply part 3. of Theorem 11.3 to the sequence Y_n= f(X_n). Since X_{n_k} converges in probability, by the previous paragraph, every subsequence f(X_{n_k}) has a convergent subsubsequence whic converges to f(X) a.s. and thus f(X_n) converges to f(X) in probability.
Using a similar argument we show that convergence in probability is preserved under arithmetic operations.
Theorem 11.5 Suppose X_n converges to X in probability and Y_n converges to Y in probability then X_n+Y_n converges to X+Y in probability, X_n-Y_n converges to X-Y in probability, X_n Y_n converges to XY in probability and X-n/Y_n converges to X/Y in probability (assuming that Y_n and Y are almost surely non-zero).
Proof. All the proofs are the same so let us do the sum. We pick a subsequence such that X_n converges to X almost surely along that subseqence. Then we pick a subsubsequence such that both X_n and Y_n converges almost surely along that subsequence. For that subsequence X_n+Y_n converges to X+Y almsost surely. We now apply this argument to subsequence of X_n+Y_n, every such subsequence as a subsubsequence which converges to X+Y almost surely and thus by part 3. of Theorem 11.3X_n+Y_n converges to X+Y almost surely.
We finish this section by showing that we can use weak convergence to turn the space of RV into a complete metric space. We will use the following metric
d(X,Y) = E[ \min\{ |X-Y|, 1\} ]
The choice is not unique, often one will find instead d(X,Y)=E[ \frac{|X-Y|}{1 + |X-Y|}].
Theorem 11.6 (Convergence in probability and metric)
X_n to converge to X in probability if and only if \lim_{n \to \infty}d(X_n,X) =0.+
The space of meaurable RV,
L^0(\Omega, \mathcal{A}, P)=\{ X :(\Omega, \mathcal{A}, P) \to (\mathbb{R},\mathcal{B}) \text{ measurable} \}
with the metric d is a complete metric space for the metric d(X,Y) (if we identify RV which are a.s. equal).
Equivalently, for any Cauchy sequence \{X_n\} for convergence in probability there exists a random variable Y such that X_n converges to Y in probbaility.
Proof. It is easy to check that d(X,Y) is a distance.
We also have for \epsilon \in (0,1) and x \ge 0
\epsilon i_\epsilon(x) \le \min\{x,1\} \le \epsilon + i_\epsilon(x)
Replacing x by |X_n-X| and taking expectations and n \to \infty shows that \epsilon P(|X_n-X| \ge \epsilon) \le d(X_n,X) \le \epsilon + P(|X_n-X| \ge \epsilon) and this proves the second claim.
Finally assume that \{X_n\} is a Cauchy sequence for the metric d. Then \lim_{n,m} d(X_n,X_m) =0 or, as we have just seen, equivalently for any \epsilon >0 we can find N such that P(|X_n-X_m| \ge \epsilon) \le \epsilon for n,m\ge N. Choose now \epsilon_k = \frac{1}{2^k} and find corresponding N_k \le N_{k+1} \le \cdots. Setting Y_k=X_{N_k}, this implies that \sum_{k} P(|Y_k-Y_{k+1}|>\epsilon_k) < \infty and thus Y_k converges almost surely to a RV Y.
We show now that X_n converges to Y in probability. Since |X_n-Y| \le |X_n - X_{N_k}| + |Y_k -Y| we have
i_\epsilon(|X_n-Y|) \le i_{\epsilon/2}(|X_n-X_{N_k}|) + i_{\epsilon/2}(|Y_k-Y|) \,.
Taking expectations gives
P(|X_n-Y| > \epsilon) \le P(|X_n-X_{N_k}| > \epsilon/2) + P(|Y_k-Y|> \epsilon/2).
The first goes to 0 as n goes to \infty since the sequence X_n is Cauchy and the second term goes to 0 since we have almost sure convergence. Therefore X_n converges to Y in probability. \quad \square.
11.4 Convergence in L^p.
Convergence in L^p simply uses the norm \|X\|_p. Most of the time we use p=1 or p=2.
Definition 11.3 (Convergence in L^p) A sequence of RVs \{X_n\}converges in L^p to a RV X if
\lim_{n \to \infty} E[|X_n-X|^p] =0
or equivalently \lim_{n \to \infty}\|X_n-X\|_p=0.
Remarks:
The limit of a sequence in L^p is unique since \|X-Y\|_p\le \|X-X_n\|_p + \|Y-X_n\|_p by Minkovski inequality.
Note that if X_n converges to X in L^1 then we have convergence of the first moments. Indeed we have
|E[X_n]-E[X]| \le E[|X_n-X|] \quad \text{and} \quad |E[|X_n|] - E[|X|]| \le E[|X_n-X|]
(the second follows the reverse triangle inequality ||x|-|y||\le |x-y|). Therefore E[X_n]\to E[X] and E[|X_n|]\to E[|X].
If X_n converges in L^1 then X_n does not need to converge almost surely. See the sequence Equation 11.2 which also converges to 0 in L^1.
If X_n converges in L^p then X_n converges in probability as well. We have P(|X-X_n| \ge \epsilon) \le E[|X_n-X|^p]/{\epsilon^p} by Markov inequality. In particular, by Theorem 11.3 if X_n converges to X in L^p then there exists a subsequence X_{n_k} which converges almost surely to X.
Conversely convergence in probability does not imply convergence in L^1. Modify the sequence in Equation 11.2 to make it Y_1=X_1, Y_2=2X_2, Y_3=2X_3, Y_4=3X_4, \cdots. This sequence converges in probability to 0 as well! This ensures that E[Y_n]=1 so Y_n does not converge to 0 in L^1. Note also that for any m there are infitely many m \ge n such that E[|X_n-X_m|=2 so the sequence cannot converge.
We prove now a converse which looks a bit like dominated convergence theorem.
Theorem 11.7 (Convergence in L^p versus convergence in probability)
If X_n converges to X in L^p then X_n converges to X in probability.
If X_n converges to X in probability and |X_n|\le Y for some Y \in L^P then X_n converges to X in L^p.
Proof. We already discussed 1. (Markov inequality). For the converse, since X_n converges in probability to X there exists a subseqeunce X_{n_k} which converges almost surely to X. Since |X_{n_k}|\le Y we see that |X|< Y and thus X \in L^p.
The sequence a_n =E[|X-X_n|^p], is bounded sinc, by Minkowski a_n^{1/p}=\|X-X_n\|_p \le \|X\|_p +\|X_n\|_p \le 2\|Y\|_p. Let a_{n_k} be aconvergent subsequence. Then since X_{n_k} converges to X in probability there exists a subsubsequence X_{n_{k_j}} which converge to X a.s Then |X_{n_{k_j}}-X|^P converges almost surely to 0 and |X_{n_{k_j}}-X|^P \le 2^p |Y|^p which is integrable. So by DCT a_{n_{k_j}} converges to 0. This implies that a_n converges to 0. \quad \square.
11.5 Exercises
Exercise 11.1 Show (by a counterxample) that if f is not continuous, convergence of X_n in probability to X does not imply convergence of f(X_n) to f(X) in probability.
Exercise 11.2 Show that if X_n is any sequence of random variables taking value in \mathbb{R} then we can find c_n such that \frac{X_n}{c_n} \to 0 almost surely. Hint: Use Borel Cantelli Lemma
Exercise 11.3 Let X_1, X_2, … be independent Bernoulli random variables with P(X_n = 1) = p_n and P(X_n = 0) = 1−p_n.
Show that X_n converges to 0 inprobability if and only if \lim_n p_n=0.
Show that X_n converges to a.s. if and only if \sum_n p_n < ∞.
Hint: Use the Borel Cantelli Lemmas
12 The law of large numbers
The law of large numbers is a foundational (and also very intuitive) concept for probability theory. Suppose we are interested in finding the the probability P(A) for some event A which is the outcome of some (random experiment). To do this repeat the experiment n times, for sufficiently large n and an approximate value for P(A) is the proportion of experiments for which the outocme belong to A
P(A) \approx \frac{ \text{ number of times the experiment belongs to } A }{n} \quad \text{ if $n$ is large enough}
This is the basis for the frequentist approach to probability, probability of events are obtained by repeating the random experiment.
12.1 Strong of law of large numbers
Consider a probability space (\Omega, \mathcal{A},P) on which real-valued random variables X_1, X_2, \cdots are defined. We define then the sum
S_n = X_1 + \cdots + X_n
Law of large numbers stands for the convergence of the average \frac{S_n}{n} to a limit. The convergence can be in probability in which case we talk of a weak law of large numbers or almost sure convergence in which case we talk about a strong law of large numbers.
Example: Normal Consider X_1, X_2, \cdots, X_n independent normal RV with mean 0 and variance \sigma^2. Then using the moment generating function we have
E\left[e^{t\frac{S_n}{n}}\right] = E\left[e^{\frac{t}{n}X_1} \cdots e^{\frac{t}{n}X_1} \right] = E\left[e^{\frac{t}{n}X_1}\right] \cdots E\left[e^{\frac{t}{n}X_1} \right] = \left(e^{\frac{\sigma^2}{2} \frac{t^2}{n^2}} \right)^n = e^{\frac{\sigma^2}{2} \frac{t^2}{n}}
By Theorem 10.8 we conclude that \frac{S_n}{n} is a normal random variable with variance \frac{\sigma^2}{n}.
By Chebyshev we conclude that P\left( \left|\frac{S_n}{n}\right| \ge \epsilon)\right) \le \frac{\sigma^2}{n \epsilon} and so \frac{S_n}{n} converges to 0 in probability. This is not enough to show almost sure convergence since \sum_{n}\frac{\sigma^2}{n \epsilon} =\infty.
By Chernov however we have (see the example after Theorem 7.7) P\left( \left|\frac{S_n}{n}\right| \ge \epsilon \right) \le e^{-n \epsilon^2/2\sigma^2} and since \sum_{n}e^{-n t^2/2\sigma^2} < \infty Borel-Cantelli Lemma (see Theorem 11.2) implies that \frac{S_n}{n} converges to 0 a.s.
Example: Cauchy Consider X_1, X_2, \cdots, X_n independent Cauchy RV with parameter \beta. Using the characteristic function (recall the chararcteristic function of a Cauchy RV is e^{-\beta|t|}) we find that
E\left[e^{it\frac{S_n}{n}}\right]=E\left[e^{\frac{it}{n}X_1} \cdots e^{\frac{it}{n}X_1} \right]
= E\left[e^{\frac{it}{n}X_1}\right] \cdots E\left[e^{\frac{it}{n}X_1} \right] = \left( e^{-\beta\left|\frac{t}{n}\right|} \right)^n = e^{-\beta|t|}
that is {S_n}{n} is a Cauchy RV with same parameter as X_i. No convergence almost sure or in probability seems reasonable here convergence in distribution will be useful here.
We start with a (not optimal) version of the LLN
Theorem 12.1 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i] and variance \sigma^2=\rm{Var}(X_i)< \infty. Then we have
\lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n}
= \mu \quad
\left\{
\begin{array}{l}
\text{almost surely}\\ \text{in probability} \\ \text{ in } L^2
\end{array}
\right.
Proof. We show first that \frac{S_n}{n} converges to \mu in L^2. To simplify the computation we write Y_i=X_i-\mu such that E[Y_i]=0 and E[Y_i^2]={\rm Var}(Y_i)={\rm Var}(X_i)=\sigma^2. We have then \frac{S_n}{n}- \mu = \frac{1}{n}(Y_1+ \cdots + Y_n) and by independence E[Y_i Y_j]=0 for i \not= j. Using all this
E\left[\left( \frac{S_n}{n}- \mu\right)^2\right] = E\left[ \left( \frac{1}{n} \sum_{j=1}^n Y_j\right)^2\right] = \frac{1}{n^2}E\left[ \sum_{i,j=1}^n X_i X_j\right] = \frac{1}{n^2}E\left[ \sum_{i=1}^n Y_i^2 \right] + \underbrace{\frac{1}{n^2}E\left[ \sum_{i \not= j} Y_i Y_j \right]}_{=0} = \frac{\sigma^2}{n}
From this we conclude that \frac{S_n}{n} converges to \mu in L^2 and therefore also in probability.
The hard part is to show that the convergence is also almost sure. First we consider an explicit subsequence on which it converges almost surely. We pick the sequence n_k = k^2. Then we have
P\left(\left|\frac{S_{k^2}}{k^2}- \mu\right| \ge \epsilon\right) \le E\left[\left( \frac{S_{k^2}}{k^2}- \mu\right)^2\right] =\frac{\sigma^2}{k^2}
and since \sum_{k} \frac{\sigma^2}{k^2} < \infty, by Borel- Cantelli Lemma \frac{S_{k^2}}{k^2} converges to \mu almost surely.
For the next step, without loss generality we assume that the RV X_i are non-negative. If not write X_i=X_{i+} - X_{i-} into positive and negative part, then the results will follow for general X from the result for positive X.
Given n pick k=k(n) such that k^2 \le n \le (k+1)^2. Since all the terms are positive we have
\frac{x_1 + \cdots + x_{k^2}}{k^2} \frac{k^2}{n} \le \frac{x_1 + \cdots + x_n}{n} \le \frac{(k+1)^2}{n} \frac{x_1 + \cdots + x_{(k+1)^2}}{(k+1)^2}
Now as n \to \infty, k=k(n) \to \infty and since \lim_{k \to \infty} \frac{(k+1)^2}{k^2} =1
1 \le \frac{n}{k^2} \le \frac{(k+1)^2}{k^2} \implies \lim_{n}\frac{n}{k^2} =1 \,.
Simlarly \lim_{n\to \infty } \frac{(k+1)^2}{n}=1 and this shows that \frac{S_n}{n} converges almost surely. \quad \square.
A stronger version exists, only existence of a finite mean is needed. Proof will be provided with aid of Martingales later on.
Theorem 12.2 (The strong law of large numbers) Suppose X_1, X_2, \cdots are independent and identically distributed random variables defined on the probability space (\Omega, \mathcal{A},P) and with mean \mu=E[X_i]. Then we have
\lim_{n \to \infty} \frac{S_n}{n} = \lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n}
= \mu \quad
\left\{
\begin{array}{l}
\text{almost surely}\\ \text{in probability}
\end{array}
\right.
We prove next that the case \mu=\infty can also be treated.
Theorem 12.3 Suppose X_1,X_2, \cdots are independent indetically distributed non-negative random variables with E[X_i]=+\infty. Then we have
\lim_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} = +\infty \quad \text{almost surely}
Proof. We use a truncation argument combined monotone convergence theorem. Pick R>0 and set Y_n = \max\{ X_n, R\} which is bounded and thus has finite variance. So by Theorem 12.2 we have, almost surely, for \mu_R =E[ \max\{ X_1, R\}]
\lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R
Since X_n \ge Y_n we have for any R
\liminf_{n \to \infty} \frac{X_1 + \cdots + X_n}{n} \ge \lim_{n \to \infty} \frac{Y_1 + \cdots + Y_n}{n} = \mu_R
But as R \nearrow \infty\max\{ X_1, R\} \nearrow Y_1 and thus by the monotone convergent theorem \mu_R =E[ \max\{ X_1, R\}]\nearrow E[X_1]=\infty. This concludes the proof. \quad \square.
12.2 Sample variance
Example: convergence of the sample variance Suppose X_1, X_2, \cdots are independent and identically distributed RV, the the sample variance is given by
V_n =\sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2
and as proved in Exercise 8.5E[V_n]= (n-1)\sigma^2.
We claim that \frac{V_n}{n} \to \sigma^2 almost surely. Indeed we have
\frac{V_n}{n}= \frac{1}{n} \sum_{i=1}^{n} \left(X_i - \frac{S_n}{n}\right)^2 = \frac{1}{n}\sum_{i=1}^{n} \left( X_i^2 - 2 X_i \frac{S_n}{n} + \frac{S_n^2}{n^2} \right) = \sum_{i=1}^{n} X_i^2 - \left(\frac{S_n}{n}\right)^2
By the law of Large numbers, Theorem 12.2, which we apply to the RV X_1, X_2, \cdots and the RV X_1^2, X^2_2, \cdots (note that E[X_1^2]=\sigma^2 + \mu^2) and by continuity, Theorem 11.4, we have
\frac{V_n}{n} \to \sigma^2 + \mu^2 - \mu^2 =\sigma^2\,.
Code
import randomimport matplotlib.pyplot as plt# Parameters for the exponential distributionlambda_parameter =0.5# Adjust this to your desired rate parameter# Initialize variables to track sample mean and sample variancesample_mean =0sample_variance =0sample_size =0# Number of samples to collectnum_samples =100000# Lists to store data for plottingsample_means = []sample_variances = []for _ inrange(num_samples):# Generate a random sample from the exponential distribution sample = random.expovariate(lambda_parameter)# Update the sample size sample_size +=1# Update the sample mean incrementally sample_mean = sample_mean + (sample - sample_mean) / sample_size# Update the sample variance incrementallyif sample_size >1: sample_variance = ((sample_size -2) * sample_variance + (sample - sample_mean) * (sample - sample_mean)) / (sample_size -1)# Append sample mean and variance to the lists for plotting sample_means.append(sample_mean) sample_variances.append(sample_variance)print(f"Last 20 sample means ={sample_means[-20:]}")print(f"Last 20 sample variances={sample_variances[-20:]}")# Plot the sample mean and sample variance as a function of sample sizeplt.figure(figsize=(8, 12))plt.subplot(211)plt.plot(range(1000, num_samples +1), sample_means[999:])plt.title("Sample Mean vs Sample Size")plt.xlabel("Sample Size")plt.ylabel("Sample Mean")plt.subplot(212)plt.plot(range(1000, num_samples +1), sample_variances[999:])plt.title("Sample Variance vs Sample Size")plt.xlabel("Sample Size")plt.ylabel("Sample Variance")plt.tight_layout()plt.show()
12.3 LLN proof of the Weierstrass approximation theorem
A classical result in analysis is that a continuous function f:[a,b] \to \mathbb{R} can be uniformlu approximated by polynomial: for any \epsilon >0 there exists a polynomila p(x) such that \sup_{x \in [a,b]}|p(x) - f(x)|\le \epsilon. Many proof of this result exists and we give one here based on the Law of Large Numbers although the statement has nothing to do with probability. Without loss of generality, by rescaling, we can take [a,b]=[0,1] and we use polynomial naturally associated to binomial random variables, the Bernstein polynomials.
Theorem 12.4 Let f:[0,1]\to \mathbb{R} be a continuous function. Let f_n(x) be the Bernstein polynomial of degree n associated to f, given by
f_n(x) = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n)\,.
Then we have
\lim_{n\to \infty} \sup_{x \in [0,1]} |f(x)-f_n(x)| =0 \,.
Proof. If X_i are IID Bernoulli with success probability p random then S_n=X_1 + \cdots + X_n is binomial random RV and
E\left[f\left(\frac{S_n}{n}\right)\right] = \sum_{k=0}^n {n \choose k} x^k(1-x)^{n-k} f(k/n) = f_n(p)
Since \frac{S_n}{n} converges p a.s and in probability we have f\left(\frac{S_n}{n}\right) converges to f(p) and thus taking expectation f_n(p) converges to f(p). We still do need to work harder, though, to establish uniform convergence, in p.
The variance of a binomial is n p(1-p) \le \frac{n}{4} is bounded uniformly in p\in [0,1] and thus by Chebyshev
P\left( \left|\frac{S_n}{n} - p\right| \ge \delta \right) \le \frac{p(1-p)}{n \delta^2} \le \frac{1}{4 n \delta^2}
Since f is continuous on a compact interval then f is bounded with \sup_{x}|f(x)|=M < \infty and f is also uniformly continuous on [0,1]. Given \epsilon >0 pick \delta such |x-y| < \delta \implies |f(x)-f(y)| < \epsilon. We have, for n large enough,
\begin{aligned}
& |f_n(p)-f(p)|=\left| E\left[f\left(\frac{S_n}{n}\right)\right] - f(p) \right| \le E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right|\right] \\
&= E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| \ge \delta} \right]
+ E\left[\left|f\left(\frac{S_n}{n}\right) -f(p)\right| 1_{\left|\frac{S_n}{n} - p\right| < \delta} \right] \le 2M \frac{1}{4 n \delta^2} + \epsilon \le 2\epsilon
\end{aligned}
This proves uniform convergence. \quad \square.
12.4 Empirical density and Glivenko-Cantelli
Example: sample CDF and empirical measure Suppose X_1, X_2, \cdots are independent and identically distributed RV with common CDF F(t)
F_n(t)=\frac{\# \{ i \in \{1,\cdots,n\}: X_i \le t\}}{n} \to F(t) \quad \text{ almost surely}\,.
\tag{12.1}F_n(t)=F_n(t,\omega) is called the empirical CDF. Note that F_n(t,\omega) is the (random) CDF for the discrete random variable with distribution
\frac{1}{n} \sum_{i=1}^n\delta_{X_i(\omega)}
which is called the empirical distribution. The convergence of F_n(t) to F(t) is is just the law of large number applied to Y_n=1_{X_n \le t} whose mean is E[Y_n]=P(Y_n \le t)=F(t).
A strengthening of the law of large number is that the empirical CDF F_n(t) converges to F(t) uniformly in t.
Theorem 12.5 (Glivenko-Cantelli Theorem) For a RV X with CDF F(t) we have
\sup_{t \in \mathbb{R}}|F_n(t)-F(t)| \text{ converges to } 0 \quad \text{almost surely}.
Proof. We only the prove the case where F(t) is continuous but the proof can be generalized to general CDF by considering the jumps more carefully. The proof relies on the fact that F is increasing which precludes oscillations and control the convergence.
First we show that we can pick a set of probability 1 such that the convergence occurs for all t \in \mathbb{R} on that set. Since a countable union of sets of probability 0 has probability 0, we can pick a set \Omega_0 of probability 1 such F_n(t,\omega) converges to F(t) for all *rational t \in \mathbb{R} and all \omega \in \Omega_0. For x \in \mathbb{R} and rational s,t with s \le x \le t we have
F_n(s,\omega) \le F_n(x,\omega) \le F_n(t,\omega)
and therefore
F(s)\le \liminf_{n}F_n(x,\omega) \le \limsup_{n}F_n(x,\omega) \le F(t)
Since F(t) \searrow F(x) as t \searrow x and F(s) \nearrow F(x) as s \nearrow x we conclude that F_n(t,\omega) \to F(t) for all \omega \in \Omega_0.
We show next that, for any \omega \in \Omega_0, the convergence is uniform in t. Since F is increasing and bounded, given \epsilon>0 we can find t_0=-\infty < t_1 < \cdots < t_m=+\infty such that F(t_j)-F(t_{j-1}) \le \frac{\epsilon}{2}. Using that F_n and F are increasing we have for t \in [t_{j-1},t_j]
\begin{aligned}
F_n(t)-F(t) &\le F_n(t_j) - F(t_{j-1}) \le F_n(t_j)- F(t_j) + \frac{\epsilon}{2}\\
F_n(t)-F(t) &\ge F_n(t_{j-1}) - F(t_{j}) \ge F_n(t_{j-1})- F(t_{j-1}) - \frac{\epsilon}{2}\\
\end{aligned}
We can now pick N_j=N_j(\omega) such that |F_n(t_j)- F(t_j)|\le \frac{\epsilon}{2} if n \ge N_j and therefore if n\ge N=\max_j N_j we have, for all t \in \mathbb{R},
|F_n(t,\omega)-F(t)| \le \epsilon
for all t and all n \ge N(\omega). This that \sup_{t\in \mathbb{R}}|F_n(t)-F(t)| converges almost surely to 0. \quad \square.
Illustration of the Glivenko-Cantelli Theorem (made with Chat GPT)
Code
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import beta# Generate random data from a Beta distributionnp.random.seed(42)true_distribution = beta.rvs(2, 5, size=1000)# Generate empirical distribution functiondef empirical_distribution(data, x):return np.sum(data <= x) /len(data)# Compute empirical distribution function for different sample sizessample_sizes = [10, 100, 1000]x_values = np.linspace(0, 1, 1000)plt.figure(figsize=(12, 8))for n in sample_sizes:# Generate a random sample of size n sample = np.random.choice(true_distribution, size=n, replace=True)# Calculate empirical distribution function values edf_values = [empirical_distribution(sample, x) for x in x_values]# Plot the empirical distribution function plt.plot(x_values, edf_values, label=f'n={n}')# Plot the true distribution functiontrue_cdf_values = beta.cdf(x_values, 2, 5)plt.plot(x_values, true_cdf_values, linestyle='--', color='black', label='True Distribution')# Plot the difference between ECDF and true CDFfor n in sample_sizes: sample = np.random.choice(true_distribution, size=n, replace=True) edf_values = [empirical_distribution(sample, x) for x in x_values] difference = np.abs(edf_values - true_cdf_values) plt.plot(x_values, difference, linestyle='dotted', label=f'n={n} (Difference)')plt.title('Glivenko-Cantelli Theorem with Beta Distribution')plt.xlabel('x')plt.ylabel('Empirical Distribution Function / Difference')plt.legend()plt.show()
12.5 The Monte-Carlo method
The (simple) Monte-Carlo method is a probabilistic algorithm using sums of independent random variables the law of large numbers to estimate a (deterministic) quantity \mu \in \mathbb{R} (or \mathbb{R}^d).
The basic idea is to express \mu as the expectation of some random variable \mu = E [ h(X) ] and then use the law of large numbers to build up an estimator for \mu.
Find a random variable h(X) such that \mu= E[h(X)].
I_n = \frac{1}{n} \sum_{k=1}^n h(X_k), where X_k are IID copies of X, is an unbiased estimator for \mu, that is we have
For all n we have E[I_n] = \mu (unbiased).
\lim_{n \to \infty} I_n = \mu almost surely and in probability.
An interesting part is that there are, in general, many ways to find the random variables h(X).
Conversely in many problems the random variable h(X) is given but the expection is too diffcult to compute, so we rely on the LLN to compute \mu.
12.6 Computing \pi with Buffon’s needles
This seems to be the first example of a rejection sampling used to solve a mathematical problem, by Le Comte de Buffon (see Bio in Wikipedia).
A needle of length l is thrown at random on floor made on floorboards of width d and we assume l \le d. We want to compute the probability that the needle does intersect two floor boards.
Denote by X the distance from the center of the needle to the nearest intersection (this is uniformly distributed on [0,\frac{d}{2}]) and by \Theta the acute angle between the needle and an horizontal line (this is uniformly distributed on [0,\frac{\pi}{2}]).
For the needle to intersect we must have x \le \frac{l}{2} \sin(\theta) and thus
P\left(X \le \frac{l}{2} \sin(\Theta)\right) = \int_0^{\frac{\pi}{2}} \int_0^{\frac{l}{2}\sin(\theta)} \frac{2}{d} dx \, \frac{2}{\pi} d\theta = \frac{2l}{d\pi}
So in order estimate \pi you shall throw n needles on the floors at random and
\pi \approx \frac{2ln}{d} \frac{1}{ \# \textrm { of needles intersecting two floor boards}}
No random number generator needed….
12.7 Computing \pi with random numbers
Enclose a disk of radius 1 in a square of side length 2 and consider the following Bernoulli random variable X.
Generate 2 independent vectors V_1,V_2 uniformly distributed on [-1,1].
If V_1^2 + V_2^2 \le 1, set X=1, otherwise set X=0.
X is Bernoulli with p=\frac{\textrm{area of the disk}}{\textrm{ area of the square}}=\frac{\pi}{4}
So \frac{4}{n}\sum_{k=1}^n X_i \to \pi with probability 1
Code
import numpy as npN =1000000dart_positions =2* np.random.rand(N, 2) -1# generates numbers in [-1, 1]Ncircle = [0] # start the count with 0 to make our loop-logic easierfor x,y in dart_positions:if x**2+ y**2<1: Ncircle.append(Ncircle[-1] +1) # another dart has fallen in the circleelse: Ncircle.append(Ncircle[-1]) # the dart fell outside of the circle running_estimate = []for n_total, n_circle inenumerate(Ncircle[1:]): # skip the inital 0-count# n_total starts at 0, so we need to add 1 running_estimate.append(4* n_circle / (n_total +1))np.transpose(running_estimate[-20:])