Part 3: Measures on Product Spaces and Conditional Expectation

Probability Theory: Math 605, Fall 2024

Luc Rey-Bellet

University of Massachusetts Amherst

2024-10-08

1 Independence and product measures

In this section we study the concept of independence in a genral form, independence of Random variables and independence of \sigma-algebra. This leads to the concept of product measures and the classic Fubini Theorem. We illustrate these ideas with vector valued random variables and some simulation algorithm. Dependent random variables will be considered in next section.

1.1 Independence

  • Two events A,B \in \mathcal{A} are independent if P(A|B)=P(A) or equivalently P(A \cap B) =P(A) P(B) and this generalizes to arbitrary collection of events (see ?@def-independence).

  • For two random variable X and Y to be independent it should means that any information or knowledege derived from the RV Y should not influence the RV X. All the information encoded in the RV X taking values in (E,\mathcal{E}) is the \sigma-algebra generated by X that is \sigma(X) = X^{-1}(\mathcal{E}) = \left\{ X^{-1}(B), B \in \mathcal{E}\right\}. This motivates the following.

Definition 1.1 (Independence) Let (\Omega, \mathcal{A},P) be a probability space.

  1. Independence of \sigma-algebras: Two sub-\sigma-algebra \mathcal{A}_1 \subset \mathcal{A} and \mathcal{A}_2 \subset \mathcal{A} are independent if P(A_1 \cap A_2) = P(A_1) P(A_2) \quad \text{ for all } A_1 \in \mathcal{A}_1, A_2 \in \mathcal{A}_2\,. A collection (not necessarily countable) of \sigma-algebras \left\{\mathcal{A}_j\right\}_{j \in J} is independent if, for any finite subset I \subset J,
    P\left( \bigcap_{i\in I} A_i\right) = \prod_{i \in I} P(A_i) \quad \text{ for all } A_i \in \mathcal{A}_i \,.

  2. Independence of random variables: The collection of random variables X_j: (\Omega, \mathcal{A},P) \to (E_j, \mathcal{E}_j) for j \in J are indedpendent if the collection of \sigma-algebras \sigma(X_j) are independent.

We consider from now on only two random variables X and Y but all of this generalizes easily to arbitrary finite collections. Our next theorem makes the definition of indepdence a bit more easy to check.

Theorem 1.1 (Characterization of independence) Two random variables X (taking values in (E,\mathcal{E})) and Y taking values in (F,\mathcal{F})) to be independent if and only if any of the following equivalent conditions holds.

  1. P(X \in A, Y \in B)=P(X \in A) P(X \in B) for all A \in \mathcal{E} and for all B \in \mathcal{F}.

  2. P(X \in A, Y \in B)=P(X \in A) P(X \in B) for all A \in \mathcal{C} and for all B \in \mathcal{D} where \mathcal{C} and \mathcal{D} are p-systems generating \mathcal{E} and \mathcal{F}.

  3. f(X) and g(Y) are independent for any measurable f and g.

  4. E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and measurable (or all non-negative) f,g.

  5. If E=F=\mathbb{R} (or \mathbb{R}^d), E[f(X)g(Y)]=E[f(X)]E[g(Y)] for all bounded and continuous functions f,g.

Proof.

\bullet Item 1. is merely a restatement of the definition and clearly item 1. \implies item 2.

  • To see that Item 2. \implies item 1. we use the monotone class theorem. Fix B \in \mathcal{D} then the collection \left\{ A \in \mathcal{E}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\} contains the p-system \mathcal{C} and is a d-system (contains \Omega, closed under complement, closed under increasing limits, check this yourself please). Therefore by the monotone class theorem it contains \mathcal{E}. Analgously fix now A \in \mathcal{E}, then the set
    \left\{ B \in \mathcal{F}\,:\, P(X \in A, Y \in B)=P(X\in A) P(Y \in B)\right\} contains \mathcal{F}.

  • To see that item 3. \implies item 1. take f(X)=1_A(X) and g(Y)=1_{B}(Y). If these two random variable are independent, this simply means that the event X\inA and Y\in B are independent. Conversely we note that V=f(X) is measurable with respect to \sigma(X): since V^{-1}(B) = X^{-1} (f^{-1}(B)) \in \sigma(X) this shows that \sigma(f(X))\subset \sigma(X). Likewise \sigma(g(Y)) \subset \sigma(Y).
    Since \sigma(X) and \sigma(Y) are independent so are \sigma(f(X)) and \sigma(g(Y)).

  • To see that item 4. \implies item 1. take f=1_A and g=1_B. To show that item 1. \implies item 4. note that item 1. can be rewritten that E[1_A(X) 1_B(Y)] =E[1_A(X)] E[1_B(Y)]. By linearity of the expectation then item 4. holds for all simple functions g and g. If f and g are non negative then we choose sequences of simple functions such that f_n \nearrow f anf g_n \nearrow g. We have then f_n g_n \nearrow fg and using the monotone convergence theorem twice we have \begin{aligned} E[f(X)g(Y)] & = E[ \lim_{n} f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) g_n(Y)] = \lim_{n} E[ f_n(X) ] E[g_n(Y)] \\ &= \lim_{n} E[ f_n(X) ] \lim_{n} E[g_n(Y)] =E[ f(X) ] E[g(Y)] \end{aligned}

If f and g are bounded and measurable then we write f=f_+-f_- and g=g_=-g_-. Then f_\pm and g_\pm are bounded and measurable and thus the product of f_\pm(X) and g_\pm(Y) are also integrable. We have \begin{aligned} E[f(X)g(Y)] & = E[(f_+(X)-f_-(X))(g_+(Y)-g_-(Y))] \\ &= E[ f_+(X)g_+(Y)]+ E[f_-(X)g_-(Y)] - E[f_+(X)g_-(Y)]- E[f_-(X)(g_+(Y)] \\ &= E[ f_+(X)]E[g_+(Y)]+ E[f_-(X)]E[g_-(Y)] - E[f_+(X)]E[g_-(Y)]- E[f_-(X])E[g_+(Y)] \\ &= E[ f_+(X)-f_-(X)] E[ g_+(Y)-g_-(Y)] \end{aligned}

\bullet Clearly item 1 \implies item 4 \implies item 5 For the converse we show that item 5. \implies item 2. Given an interval (a,b) consider an increasing sequence of piecewise linear function continuous function f_n \nearrow 1_{(a,b)}. f_n(t) = \left\{ \begin{array}{cl} 0 & t \le a + \frac{1}{n} \text{ or } t \ge b-\frac{1}{n} \\ 1 & a+\frac{1}{n} \le t \le b-\frac{1}{n}\\ \text{linear} & \text{otherwise } \end{array} \right. Let us consider the p-system which contains all intervals of the form (a,b) and which generate the Borel \sigma-algebra \mathcal{B}. By using a monotone convergence argument like for item 1. \implies item 4. we see that for f_n \nearrow 1_{(a,b)} and g_n \nearrow 1_{(c,d)} E[f_n(X) g_n(Y)]= E[f_n(X)] E[g_n(Y)] \implies P(X \in (a,b)) P(Y \in (c,d)) and so item 2. holds.
For separable metric space, the so-called Urysohn lemma can be used to prove the same results. \quad \square

1.2 Independence and product measures

  • While we have expressed so far independence of X and Y as a property on the probability spacew (\Omega,\mathcal{A},P) we can also view it as a property of the distributions P^X and P^Y.

  • Example: Suppose X and Y are discrete random variables then if they are independent we have P(X =i, Y=j) = P(X=i) P(Y=j) and thus P^{(X,Y)}(i,j) = P^X(i) P^Y(j). That is the distribution of the random variable Z=(X,Y) factorizes into the product of P^X and P^Y.

  • Product spaces. In order to build-up more examples we need the so-called Fubini theorem. Given two measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) we consider the product space (E \times F, \mathcal{E} \otimes \mathcal{F}) where \mathcal{E} \otimes \mathcal{F} is the sigma-algebra generated by the rectangles A \times B (see ?@exr-45)

  • Measurable functions on product spaces. As we have seen in ?@exr-45 for any measurable function f: E \times F \to \mathbb{R} the sections g(y)=f(x,y) (for any x) and h(x)=f(x,y) (for any y) are measurable.

Theorem 1.2 (Tonelli-Fubini Theorem) Suppose P is a probability on (E,\mathcal{E}) and Q is a probability on (F,\mathcal{F}).

  1. The function R: \mathcal{E}\times \mathcal{F} defined by R(A \times B) = P(A) Q(B) extends to a unique probability measure on \mathcal{E}\otimes \mathcal{F}. This measure is denoted by P \otimes Q and is called the product measure of P and Q.

  2. Suppose f is measurable with respect to \mathcal{E}\otimes \mathcal{F}, either non-negative, or integrable with respect to P \otimes Q. Then the functions x \mapsto \int f(x,y) dQ(y) \quad \text{ and } \quad y \mapsto \int f(x,y) dP(x) are integrable with respect to P and Q respectively and we have \int_{E \times F} f(x,y) d (P\otimes Q) (x,y) = \int_E \left(\int_F f(x,y) dQ(y)\right) dP(x) = \int_F \left(\int_E f(x,y) dP(x)\right) dQ(y)

Proof. For item 1. we need to extend the R to arbitray element in \mathcal{E}\otimes\mathcal{F}. For C \in\mathcal{E}\otimes\mathcal{F} consider the slice of C along x given by C(x) = \left\{ y \in F\,: (x,y) \in C \right\} \,. If C = A \times B, then C(x)=B for all x\in A and C(x)=\emptyset for x \notin A and we have then R(C) = P(A)Q(B) = \int_E Q(C(x)) dP(x) Now we define \mathcal{H}=\left\{ C \in \mathcal{E}\otimes\mathcal{F}\,:\, x \to Q(C(x)) \text{ is measurable} \right\} it is not difficult to check that \mathcal{H} is a \sigma-algebra and \mathcal{H} \supset E \times F and therefore \mathcal{H} = E \otimes F.
We now define, for any C \in \mathcal{H} = E \otimes F, R(C) = \int_E Q(C(x)) dP(x) and we check this is a probability measure. Clearly R(E \times F) = P(E)Q(F)=1. Let C_n \in E \otimes F be pairwise disjoint and C = \cup_{n=1}^\infty C_n. Then the slices C_n(x) are pairwise disjoint and by the monotone convergence theorem Q(C(x))=\sum_{n=1}^\infty Q(C_n(x)).

Applying MCT again to the function g_n(x) = \sum_{k=1}^n Q(C_n(x)) we find that \sum_{n=1}^\infty R(C_n) = \sum_{n=1}^\infty \int_E Q(C_n(x)) dP(x) = \int_E \sum_{n=1}^\infty Q(C_n(x)) dP(x) = \int_E Q(C(x)) dP(x) =R(C). and this shows that R is a probability measure. Uniqueness of R follows from the monotone class theorem.

For item 2. note that in item 1, we have proved it in the case where f(x,y)= 1_C(x,y). By linearity the result then holds for simple functions. If f is nonnegative and measurable then pick an increasing sequence such that f_n \nearrow f. Then by MCT \int f(x,y) d P\otimes Q(x,y) = \lim_{n} \int f_n(x,y) d P\otimes Q(x,y) = \lim_{n} \int_E (\int_F f_n(x,y) d Q(y)) dP(x) But the function x \to \int_F f_n(x,y) d Q(y) is increasing in n and by MCT again, and again. \begin{aligned} \int f(x,y) d P\otimes Q(x,y) & = \int_E \lim_{n} (\int_F f_n(x,y) d Q(y)) dP(x) = \int_E (\int_F \lim_{n} f_n(x,y) d Q(y)) dP(x) \\ &= \int_E (\int_F f(x,y) d Q(y)) dP(x) \end{aligned} Simlarly one shows that \int f(x,y) d P\otimes Q(x,y) = \int_F (\int_E f(x,y) d P(x)) dQ(y). The result for integrable f follows by decomposing into postive and negative part. \quad \square

Applying this to random variables we get

Corollary 1.1 Suppose Z=(X,Y): (\Omega, \mathcal{A}, P) \to (E \times F, \mathcal{E}\otimes \mathcal{F}). Then the random variables X and Y are independent if and only if the distribution P^{(X,Y)} of the pair (X,Y) is equal to P^X \otimes P^Y.

Proof. The random variables X and Yare independent if and only if P((X,Y) \in A \times B)=P(X \in A) P(X \in B)\,.
This is equivalent to saying that P^{(X,Y)}(A \times B) = P^X(A) \times P^Y(B) By the uniqueness in Fubini Theorem we have P^{(X,Y)}=P^X \otimes P^Y. \quad \square

1.3 Constructing a probability space for independent random variables

We can construct a probability model for X_1, \cdots, X_n which are independent (real-valued) random variables with given distribution P^{X_1}, \cdots ,P^{X_n}.

We know how to construct the probability space for each random variable separately, for example, X_i : (\Omega_i,\mathcal{B_i},P_i) \to (\mathbb{R}, \mathcal{B}) where \Omega_i=[0,1], P_i is Lebesgue measure on [0,1] and X_i = Q_i where Q_i is a quantile function for X_i.

We know take \Omega = \prod_{i} \Omega_i = [0,1]^n \,, \quad \mathcal{B}= \otimes_{i=1}^n \mathcal{B_i}, \quad P= \otimes_{i=1}^n P_i and define the map X: \Omega \to \mathbb{R}^n by X=(X_1,\cdots X_n).

Fubini-Tonelli Theorems shows that P^{X}=P \circ X^{-1} is the distribution of X=(X_1,\cdots X_n) on \mathbb{R}^n with the product \sigma-algebra and that the random variables are independent

Fubini-Tonelli for countably many RV:

We extend this result to countable many independent random variables: this is important in practice where we need such models: for example flipping a coin as many times as needed!

This can be seen as an extension of Fubini theorem and is also a specical case of the so-called Kolmogorov extension theorem which us used to construct general probability measures on infinite product space. No proof is given here.

Infinite product \sigma-algebras
Given \sigma-algebras \mathcal{A}_j on \Omega_j we set \Omega = \prod_{j=1}^\infty \Omega_j and define rectangles for n_1 < n_2 < \cdots < n_k and k finite but arbitrary A_{n_1} \times A_{n_2} \times \cdots \times A_{n_k} = \{ \omega=(\omega_1,\omega_2,\omega_3, \cdots) \in \Omega \;: w_{n_j} \in A_{n_j}\} where A_{n_j} \in \mathcal{A_{n_j}}. The product \sigma-algebra \mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j the \sigma-algebras generated by all the rectangles.

Theorem 1.3 Given probaility spaces (\Omega_i,\mathcal{A}_i,P_i) and with \Omega = \prod_{j=1}^\infty \Omega_i and \mathcal{A} = \bigotimes_{j=1}^\infty \mathcal{A}_j there exists a unique probability P on \Omega(A) such that P(A_{n_1} \times \cdots \times A_{n_k}) = \prod_{j=1}^k P_{n_j}(A_{n_j}) for all A_{n_j}\in \mathcal{A}_{n_j}, all n_1< \cdots < n_k and all k.

If we have a RV X_n: \Omega_n \to \mathbb{R} then we define its extension to \Omega by \tilde{X_n}(\omega) = X_n(\omega_n) and the distribution of \tilde{X_n} is the same as the distribution of X_n because \tilde{X_n}^{-1}(B_n) = \Omega_1 \times \cdots \times \Omega_{n-1} \times X_n^{-1}(B_n) \times \Omega_{n+1} \times \cdots and thus P(\tilde{X_n}\in B_n) =P_n (X_n \in B_n)\,. A similar computation shows that \tilde{X}_n and \tilde{X}_m are independent for n \not= m or more generally any finite collection of X_j's are independent.

1.4 Kolmogorov zero-one law

We consider (X_n)_{n=1}^\infty to be RVs defined on some probability space \Omega. We may think of n as “time” and we consider the following \sigma-algebras
\begin{aligned} \mathcal{B}_n & = \sigma(X_n) & \text{the $\sigma$-algebra generated by } X_n \\ \mathcal{C}_n & = \sigma\left( \cup_{p \ge n} \mathcal{B}_n\right) & \text{the $\sigma$-algebra describing the "future" after time } n \\ \mathcal{C}_\infty & = \cap_{n=1}^\infty \mathcal{C}_n & \text{the "tail" $\sigma$-algebra or $\sigma$-algebra "at infinity"} \end{aligned}

Theorem 1.4 (Zero-one law) Suppose X_n is a sequence of independent random variables and let \mathcal{C_\infty} be the corresponding tail \sigma-algebra. Then we have C \in \mathcal{C}_\infty\implies P(C)=0 \text{ or } P(C)=1

Proof. The \sigma algebras \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_n\} are independent and therefore \{\mathcal{B}_1, \cdots, \mathcal{B}_n, \mathcal{C}_\infty\} are independent for every n since \mathcal{C}_\infty \subset \mathcal{C}_n. Therefore \mathcal{C}_0=\sigma\left( \cup_{n \ge 0} \mathcal{B}_n\right) and \mathcal{C}_\infty are independent. So for A \in \mathcal{C}_0 and B \in \mathcal{C}_\infty we have P(A \cap B)=P(A) P(B). This holds also for A=B since \mathcal{C}_\infty \subset \mathcal{C}_0. Therefore we have P(A)=P(A)^2 which is possible only if P(A)=0 or 1 \quad \square.

Examples Given independent random variable X_1, X_2, \cdots we define S_n = X_1 + X_2 +\cdots + X_n.

  • The event \{ \omega\,:\, \lim_{n} X_n(\omega) \text{ exists }\} belongs to every \mathcal{C}_n and thus belong to \mathcal{C}_\infty. Therefore X_n either converges a.s or diverges a.s.

  • A random variable which is measurable with respect to \mathcal{C}_\infty must be constant almost surely. Therefore \limsup_n X_n \,, \quad \liminf_n X_n\,, \quad \limsup_n \frac{1}{n} S_n\,, \quad \liminf_n \frac{1}{n} S_n are all constant almost surely.

  • The event \limsup_{n} \{ X_n \in B\} (also called \{ X_n \in B \text{ infinitely often }\}) is in \mathcal{C}_\infty.

  • The event \limsup_{n} \{ S_n \in B\} is not in \mathcal{C}_\infty.

1.5 Exercises

Exercise 1.1  

  • Suppose X\ge 0 is a non-negative random variable and p > 0. Show that E[X^p]= \int_0^\infty p t^{p-1} (1- F(t)) dt = \int_0^\infty p t^{p-1} P(X>t) dt In particular E[X]=\int P(X>t) dt.
    Hint: Use Fubini on the product measure P times Lebesgue measure on [0,\infty).

  • Deduce from this that if X takes non-negative integer values we have E[X]= \sum_{n >0} P( X>n)\,, \quad E[X^2]= 2 \sum_{n >0} n P( X>n) + E[X] \,.


Exercise 1.2 Find three random variable X, Y, and Z taking values in \{-1,1\} which are pairwise independent but are not independent.

Exercise 1.3  

  • A random variable is a Bernoulli RV if X takes only values 0 or 1 and then E[X]=P(X=1) (alternatively you can think of X(\omega)=1_A(\omega) for some measurable set A.) Show that Y=1-X is also Bernoulli and that the product of two Bernoulli random variables is again a Bernoulli RV (no independence required).

  • Suppose X_1, X_2, \cdots, X_n are Bernoulli random variables on some probability space (\Omega,\mathcal{A},P) (they are not assumed to be independent) and let Y_k = 1-X_k. Show that P(X_1=0, X_2=0, \cdots, X_n=0) =E\left[ \prod_{i=1}^n Y_i\right] and P(X_1=0, \cdots, X_k =0,X_{k+1}=1, \cdots, X_n=1) =E \left[ \prod_{i=1}^k Y_i \prod_{j=k+1}^n X_j\right]

  • Show that the Bernoulli random variables X_1, \cdots, X_n are independent if and only E[\prod_{j\in J} X_j]= \prod_{j\in J}E[X_j] for all subset J of \{1,\cdots, n\}.

Exercise 1.4 Consider the probability space ([0,1), \mathcal{B}, P) where \mathcal B is the Borel \sigma-algebra and P is the uniform distribution on [0,1). Expand each number \omega in [0,1) in dyadic expansion (or bits) \omega = \sum_{n=1}^\infty \frac{d_n(\omega)}{2^n} =0. d_1(\omega)d_2(\omega)d_3(\omega) \cdots \quad \quad \textrm{ with } \quad d_n(\omega) \in \{0,1\}

  • For certain numbers the dyadic expansion is not unique. For example we have \frac{1}{2} = 0.100000 = \frac{1}{4} + \frac{1}{8} + \frac{1}{16} + \cdots = 0.01111111111\cdots Show that P almost surely a number \omega has a unique dyadic expansion.

  • Prove that each d_n(\omega) is a random variable and that they are identically distributed (i.e. each d_n has the same distribution).

  • Prove that the d_n, n=1,2,3, \cdots are a collection of independent and identically distributed random variables.

Remark: this problem shows that you can think of the Lebesgue measure on [0,1) as the infinite product measure of independent Bernoulli trials.

Exercise 1.5 Suppose X and Y are RV with finite variance (or equivalently X,Y\in L^2. The covariance of X and Y is defined by {\rm Cov}(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY]- E[X]E[Y]

  1. Show that {\rm Cov}(X,Y) is well defined and bounded by \sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}.

  2. The correlation coefficient \rho(X,Y)=\frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var{X}}}\sqrt{{\rm Var{Y}}}} measure the correlation between X and Y. Given a number \alpha \in [-1,1] find two random variables X and Y such that \rho(X,Y)=\alpha.

  3. Show that {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(X+Y) + 2{\rm Cov}(X,Y).

  4. Show that if X and Y are independent then {\rm Cov}(X,Y)=0 and so {\rm Var}(X+Y)= {\rm Var}(X) + {\rm Var}(Y).

  5. The converse statement of 3. is in general not true, that is {\rm Cov}(X,Y)=0 does not imply that X and Y are independent. ` Hint For example take X to be standard normal Z discrete with P(Z=\pm 1)=\frac12 (Z is sometimes called a Rademacher RV) and Y=ZX.

  6. Suppose X_1, \cdots, X_n are independent RV with E[X_j]=\mu and {\rm Var}(X_j)=\sigma^2< \infty for all j. The sample mean \overline{X} and sample variance S^2 are given by \overline{X}=\frac{1}{n}\sum_{j=1}^n X_j \quad \quad S^2= \frac{1}{n-1}\sum_{j=1}^n(X_j -\overline{X})^2 Show that E[\overline{X}]=\mu, {\rm Var(\overline{X})=\frac{\sigma^2}{n}} and E[S^2]=\sigma^2.

2 Probability kernels and measures on product spaces

2.1 Probability kernels

How do we build “general” measures on some product space E \times F (e.g. \mathbb{R}^2 or \mathbb{R}^n)

Definition 2.1 Given 2 measurable spaces (E,\mathcal{E}) and (F,\mathcal{F}) a probability kernel K(x,B) from (E,\mathcal{E}) into (F,\mathcal{F}) (also often called a Markov kernel) is a map K: E \times \mathcal{F} \to \mathbb{R} such that

  1. For any B\in \mathcal{F} the map x \to K(x,B) is a measurable map.

  2. For any x\in E the map B \to K(x,B) is a probability measure on (F,\mathcal{F}).

Examples

  • If E and F are countable sets then a kernel is built from a transition matrix K(i,j), i \in E and j \in F such that K(i,j) \ge 0 \text{ and } \sum_{j \in F} K(i,j) =1 We have then K(i,B)= \sum_{j \in B} K(i,j).

  • Suppose (F,\mathcal{F})=(\mathbb{R},\mathcal{B}) then a kernel is built from a “conditional density” k(x,y) where x \in E and y\in \mathbb{R} and such that g is measurable and \int_R k(x,y) dx=1 for all x (see examples below).

The following theorem is a generalization of Fubini-Tonelli

Theorem 2.1 Let P be a probability measure on (E,\mathcal{E}) and K(x,B) a probability kernel from (E,\mathcal{E}) into (F, \mathcal{F}). Then a probability measure R on (E \times F, \mathcal{E}\otimes \mathcal{F}) is defined by \int f(x,y) dR(x,y) = \int_X \left(\int_F f(x,y) K(x,dy)\right) dP(x) for any measurable non-negative f. In particular we have for any A \in \mathcal{E} and B \in \mathcal{F} R( A \times B) = \int 1_A(x) 1_B(y) dR(x,y) = \int 1_A(x) k(x,B) dP(x) = \int_A K(x,B) dP(x)\,. We write R(dx,dy)=P(dx)K(x,dy)

Proof. The proof is very similar to Fubini-Tonelli theorem and is thus omitted.

  • This theorem is intimately related to the concepts of conditional probability and conditional expectations which we will study later.

  • Roughly speaking, on nice probability space (e.g. separable metric spaces), every probaility measure on the product space E \times F can be constructed in the way described in Theorem 2.1 (this is a deep result).

Definition 2.2 (marginals of a probability measure) Given a probability measure R on the product space (E \times F, \mathcal{E}\otimes \mathcal{F}) the marginals of R are on E and F are defined defined to be the measures given by \begin{aligned} P(A) &= R(A \times F)\, \quad A \in \mathcal{E} \\ Q(B) &= R(E \times B)\, \quad B \in \mathcal{F} \end{aligned} Alternatively we can think of the marginals as the image measure P = R \circ \pi_E^{-1}, \quad \quad P = R \circ \pi_F^{-1} where \pi_E and \pi_F are the projection maps \pi_E(x,y) = x and \pi_F(x,y) = y

  • If R=P \otimes Q is a product measure then the marginal of R are P and Q. This is for the kernel K(x,A)=Q(A).

  • If R(dx,dy)= P(dx) K(x,dy) then we have R(A \times F) = \int_A K(x,F) dP(x) = P(A) \quad \quad R(E \times B) = \int_E K(x,B) dP(X) so its marginals are P and and Q given by Q(B) = \int_E K(x,B) dP(x).

2.2 Lebsgue measure on \mathbb{R}^n and densities

In previous chapter we have constructed the Lebesgue probability measure P_0 on [0,1] as the unique measure such that P_0[(a,b]]=(b-a). Using this and other uniformly distributed random variables we define the Lebesgue measure on \mathbb{R} and on \mathbb{R}^n

Definition 2.3 The Lebesgue measure on \mathbb{R} is a set function m such that

  1. m(\emptyset)=0.

  2. m\left(\bigcup_{i=1}^\infty A_i \right)=\sum_{i=1}^\infty m(A_i)

  3. m((a,b])=b-a for any a < b.

The Lebsgue measure on \mathbb{R} is not a probability measure since m(\mathbb{R})=+\infty, it is an example of an infinite measure. We can easily construct it using uniform random variables on (n,n+1] with distribution P_{n,n+1} namely we set
m(A) = \sum_{n=-\infty}^\infty P_{n,n+1}(A) The uniqueness of the measure P_{n,n+1} implies that the measure m is unique as well.

By doing a Fubini-Tonelli theorem type argument one can construct the Lebesgue measure on \mathbb{R}^n.

Definition 2.4 If we equip \mathbb{R}^n with the product \sigma-algebra \mathcal{B}\otimes \cdots \otimes \mathcal{B} the Lebesgue measure m_n on \mathbb{R^n} is the product of n Lebesgue measure on \mathbb{R}. We have m_n\left( \prod_{i=1}^n [a_i,b_i]\right)= \prod_{i=1}^n (b_i-a_i)


Notations we often use the notation dx or dx_1 \cdots dx_n for integration with respect to m_n.


Definition 2.5 A probability measure on (\mathbb{R}^n,\mathcal{B}_n) (where \mathcal{B}_n=\mathcal{B} \otimes \cdots \otimes \mathcal{B}) has a density f if f is a nonnegative Borel measurable function and P(A)\,=\, \int_A f(x) dx = \int 1_A(x) f(x) dx = \int f(x_1, \cdots, x_n) 1_A(x_1, \cdots, x_n) dx_1 \cdots dx_n

Theorem 2.2 A non-negative Borel measurable function f(x) is the density of a Borel probability measure if and only if \int f(x) dx =1 and it determines the probability P. Conversely the probability measure determines its density (if it exists!) up to a set of Lebesgue measure 0.

Proof. Given f\ge 0 it is easy to check that P(A) = \int 1_A f(x) dx defines a probbaility measure (same proof as in ?@exr-63).
Conversely assume that f and f' are two densities for the measure P, then for any measurable set A we have P(A)=\int_A f(x) dx = \int_A f'(x) dx. Consider now the set A_n =\left\{ x : f'(x)\ge f(x)+ \frac1n\right\} Then we have P(A_n)= \int_A f'(x) dx \ge \int_A \left(f(x) + \frac1n\right) dx = P(A) + \frac1n m(A) and therefore m(A_n)=0 and since A_n increases to A=\{f < f'\} we have shown that m(\{f < f'\})=0. By symmetry we have f=f' a.s. \quad \square.

Theorem 2.3 Suppose the random variable (X,Y) has a probability distribution R(dx,dy) with density f(x,y). Then

  1. Both X and Y have densities given respectively by f_X(x)=\int f(x,y) dy\,,\quad \quad f_Y(y)=\int f(x,y) dx \,.

  2. X and Y are independent if and only if f(x,y) = f_X(x)f_Y(y)\,.

  3. If we set k(x,y)= \frac{f(x,y)}{f_X(x)} \text{ if } f_X(x) \not= 0 and this defines a kernel K(x,B)=\int_B k(x,y)dy and the probability distribution(X,Y) is given by R(dx,dy)=f_X(x)k(x,y)dxdy.

Remark It does not matter how K(x,B) is defined for such x where f_X(x)=0 and so we have left it undefined. There are in general many kernels densities which will give the same probability aloowing for changes on sets of zero probbaility.

Proof.

  1. For A \in \mathcal{B} we have P(X\in A) = P(X \in A, Y \in \mathbb{R}) = \int_A \left(\int_R f(x,y) dy\right) dx = \int_A f_X(x) dx and since this holds for all A, f_X(x) is a density for the distribution of X.

  2. If f(x,y)=f_X(x) f_Y(y) then \begin{aligned} P(X \in A, Y \in B)&= \int 1_{A \times B}(x,y) f_X(x) f_Y(y) dx dy = \int 1_{A}(x) f_X(x) dx \int 1_{B}(x) f_Y(y) dy \\ &=P(X\in A)P(Y\in B) \end{aligned} Conversely assume that X and Y are independent. Consider the collection of sets \mathcal{H} =\left\{ C \in \mathcal{B}\otimes \mathcal{B} \,:\, \int_C f(x,y) dx dy = \int_C f_X(x) f_Y(y) dx dy\right\} The independence and Fubini Theorem implies that any set C = A \times B belongs to \mathcal{H}. Since this is a p-system generating the \sigma-algebra the monotone class theorem shows that \mathcal{H}=\mathcal{B}\otimes \mathcal{B}.

To prove item 3, we have \int k(x,y) dy = \int \frac{f(x,y)}{f_X(x)} dy = \frac{f_X(x)}{f_X(x)} =1 and thus k(x,y) is a density. Furthermore we have \begin{aligned} P( X \in A, Y \in B) &= \int_A K(x,B) dP(x) = \int_A f_X(x) \left(\int_B k(x,y) dy \right) dx = \int_A f_X(x) \left(\int_B \frac{f(x,y)}{f_X(x)} dy \right) dx \\ & = \int_A \int_B f(x,y) dx dy \end{aligned} and this concludes the proof since the measure is uniquely determined by its value on rectangles (by a monotone class theorem arguemnt).

2.3 Example: Box Muller algorithms

We derive here a method to generate two independent normal random variables using two independent random number. This is a different algorithm that the one using the quantile for the normal RV which is known only numerically.

Theorem 2.4 Suppose that U_1 and U_2 are two independent random numbers then \begin{aligned} X_1 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2) \quad \quad \quad X_2 = \sqrt{-2\ln(U_1)} \cos( 2 \pi U_2) \end{aligned} are two independent normal random variables with mean 0 and variance 1.

Proof. We use the expectation rule ?@thm-expectationrule togther with polar coordinates. For any nonnegative function h(x_1,x_2) we have, using polar coordinate x_1 =r \cos\theta and x_2=r \sin \theta and then the change of variable s=r^2/2 \begin{aligned} E[ h(X_1,X_2)] & = \int h(x_1,x_2) f(x_1,x_2) dx_1 dx_2 = \int_{\mathbb{R}^2} h(x_1,x_2) \frac{1}{2\pi} e^{-\frac{\|x\|^2}{2}} dx_1 dx_2 \\ & = \int_{[0,\infty]\times[0,2\pi]} h(r \cos\theta,r \sin\theta) \frac{1}{2\pi}d\theta r e^{-\frac{r^2}{2}}dr \\ & = \int_{[0,\infty]\times[0,2\pi]} h(\sqrt{2s} \cos\theta, \sqrt{2s} \sin\theta) \frac{1}{2\pi}d\theta e^{-s} ds \end{aligned} This computation shows that if S is exponential with parameter 1 and \Theta is uniform on [0,2\pi] then \sqrt{2 S}\cos(\Theta) and \sqrt{2 S}\sin(\Theta) are independent standard normal. But we can write also S=-\ln(U_1) and \Theta=2\pi U_2. \quad \square.

2.4 Exponential mixture of exponential is polynomial

Let us consider an exponential random variable Y whose parameter is itself an exponential random variable X with paramter \lambda>0. That is X has density f(x)= \left\{ \begin{array}{cl} \lambda e^{-\lambda x} & x \ge 0 \\ 0 & \text{else} \end{array} \right. and consider the kernel K(x,dy)=k(x,y)dy with k(x,y) = \left\{ \begin{array}{cl} xe^{-xy} & x>0, y>0 \\ 0 & \text{ else } \end{array} \right. Then (X,Y) has density f(x,y)= \left\{ \begin{array}{cl} \lambda e^{-\lambda x} x e^{-xy} = \lambda x e^{-(\lambda +y)x} & x>0, y>0 \\ 0 & \text{ else } \end{array} \right. Then, using that the mean of an exponential RV is the reciprocal of the paramter, the density of Y is
f(y) = \int_0^\infty f(x,y) dx = \frac{\lambda}{\lambda +y} \int_0^\infty x (\lambda +y) e^{-(\lambda +y)x}= \frac{\lambda}{(\lambda+y)^2} which decays polynomially!

2.5 gamma and beta random variables

Recall that a gamma RV has density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\lambda x} for x \ge 0.

Consider now two independent gamma random RV X_1 and X_2 with paramters (\alpha_1, \beta) and (\alpha_2, \beta). We prove the following facts

  1. Z=X_1 + X_2 is a gamma RV with parameters (\alpha_1+\alpha_2, \beta)

  2. U=\frac{X_1}{X_1+X_2} is a beta distribution with paramters \alpha_1 and \alpha_2 which has the density f(u) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2} \quad 0 \le u \le 1

  3. X_1 + X_2 and \frac{X_1}{X_1+X_2} are independent RV

We use the expectation rule ?@thm-expectationrule and the change of variable z= x_1+x_2 and u= \frac{x_1}{x_1+x_2} or x_1= uz and x_2=(1-u)z. This maps [0,\infty) \times [0,\infty) to [0,1] \times [0,\infty) and the Jacobian of this transformation is equal z.

We have then for any nonnegative h \begin{aligned} E[h(Z,U)]&=E\left[h\left((X_1+X_2,\frac{X_1}{X_1+X_2}\right)\right]\\ &= \int_0^\infty \int_0^\infty h\left(x_1+x_2,\frac{x_1}{x_1+x_2}\right) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} x_1^{\alpha_1-1} x_2^{\alpha_2 -1} e^{- \beta(x_1 + x_2)} dx_1 dx_2 \\ & = \int_0^1 \int_0^\infty h(z,u) \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1) \Gamma(\alpha_2)} (uz)^{\alpha_1-1} ((1-u)z)^{\alpha_2-1} e^{-\beta z} z du dz \\ &= \int_0^1 h(z,u) \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du \int_0^\infty \frac{\beta^{\alpha_1+\alpha_2}}{\Gamma(\alpha_1+ \alpha_2)} z^{\alpha_1 + \alpha_2 -1} e^{-\beta z} dz \end{aligned} and this proves all three statements at once.

Remark This is a nice, indirect way, to compute the normalization for the density of the \beta distribution which is proportional to u^{\alpha_1-1}(1-u)^{\alpha_2-1}.

2.6 The beta binomial model and a whiff of Bayesiasn statistics

A beta random variable U has mean E[U] = \int_0^1 u \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} u^{\alpha_1-1} (1-u)^{\alpha_2-1} du = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1 +1) \Gamma(\alpha_2)}{\Gamma(\alpha_1 +\alpha_2 +1)} = \frac{\alpha_1}{\alpha_1+\alpha_2} and proceeding similarly one finds that {\rm Var}(U)= \frac{\alpha_1 \alpha_2}{(\alpha_1+\alpha_2)^2 (\alpha_1 + \alpha_2+1)}

There is a natural connection between a binomial random variable (discrete) and the beta random variable (continuous) (let us call it P for a change). The pdf looks strangely similar {n \choose j} p^j (1-p)^{n-j} \quad \text{ versus } \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 -1} But p is a parameter for the binomial while it is a variable for the second!

Model:

  • Make n indepdent trials, each with a (random) probability P.

  • Take P to have a beta distribution with suitably chosen parameter \alpha_1,\alpha_2.

  • The mean \frac{\alpha_1}{\alpha_1+\alpha_2} is your average guess for the “true” probability p and by adjusting the scale you can adjust the variance (uncertainty) associated with your guess.

This leads to considering a random variable (X,P) taking value in \{0,1,\cdots,n\} \times [0,1] with a “density” f(j,p) = \underbrace{{n \choose j} p^j (1-p)^{n-j}}_{=k(p,j) \text{ kernel }} \underbrace{\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} p^{\alpha_1-1} (1-p)^{\alpha_2-1} }_{=f(p)} which is normalized \sum_{j=0}^n \int_0^1 f(k,p) dp =1

The beta-binomial distribution with paramters (n,\alpha_1,\alpha_2) is the marginal distribution of X on \{0,1,\cdots,n\} given by \begin{aligned} f(j) = \int_0^1 f(j,p) dp & = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \int_0^1 p^{\alpha_1+n -1}(1-p)^{\alpha_2+n-k-1} \\ & = {n \choose j} \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \frac{\Gamma(\alpha_1+j) \Gamma(\alpha_2+n-j)}{\Gamma(\alpha_1+\alpha_2+n)} \end{aligned}

Bayesian statistics framework

  • We interpret the distribution f(p) with parameter \alpha_1 and \alpha_2 as the prior distribution which describe our beliefs before we do any experiment. For example \alpha_1=\alpha_2=1 correspond to a uniform distribution on p (that is we are totally agnostic, a fine choice if you know nothing).

  • The beta-binomial which is the marginal f(j)=\int_0^1 f(j,p) dp describe the distribution of the independent trials under this model. It is called the evidence.

  • The kernel k(p,j) is called the likelihood which describes the number of success, j, given a certain probability of success, p. It is called the likelihood when we view it as a function of p and think of j as a parameter.

  • Now we can write the distribution using kernels in two ways: f(j,p)= k(p,j)f(p) = k(j,p) f(j) and the kernel k(j,p) is called the posterior distribution. It is interpreted as the distribution of p given that j trials have occured.

  • We can rewrite this (this is just a version of Bayes theorem) as \text{posterior} = k(j,p) = \frac{k(p,j)f(p)}{f(j)} = \frac{\text{likelihood}\times\text{prior}}{\text{evidence}}\,.

  • For the particular model at hand k(j,p) \propto p^{j} (1-p)^{n-j} p^{\alpha_1-1} (1-p)^{\alpha_2-1} \propto p^{\alpha_1+j-1} (1-p)^{\alpha_2+ n- j-1} and therefore k(j,p) has a binomial distribution with parameter \alpha_1 +j and \alpha_2+n-j.

  • This is special: the prior and posterior distribution belong to the same family. We say that we have conjugate priors and this is a simple model to play. with. In general we need Monte-Carlo Markov chains to do the job.

  • Example: Amazon seller ratings You want to buy a book online

Vendor 1 has 151 positive rating and 7 negative rating (95.5%).
Vendor 2 has 946 positive ratings and 52 negative ratings (94.7%).
Uniform prior with \alpha_1=\alpha_2=1 gives two beta posterior with \alpha_1=152 and \alpha_2=8 and \alpha_1=947 and \alpha_2=53.

2.7 Exercises

Exercise 2.1  

  1. Suppose that X is a gamma RV with parameter \alpha and \beta and Z is an exponential RV with parameter 1 and suppose further that Z and X are independent. Find the CDF (and then the PDF) of Y=\frac{Z}{X}.

  2. Proceeding as in Section 2.4 consider an exponential RV Y whose parameter X has a gamma distribution with paramters \alpha and \beta. Find the marginal distribution of Y.

  3. Compare 1. and 2. and explain why they give the same result.

Exercise 2.2 (Poisson-Gamma model) Consider a Poisson RV X with a random parameter \Lambda which itself has a gamma distribution (for some parameters (\alpha,\beta).)

  1. What are the joint density and the kernel, f(j,\lambda)=k(\lambda,j) f(\lambda), for the pair (X,\Lambda).

  2. What is the density f(j) of X? (the “evidence”)?

  3. What is the “posterior distribution” (that is what is the kernel k(j,\lambda) if we write f(j,\lambda)= k(j,\lambda)f(j))?

3 Conditional expectation

3.1 Conditioning on a random variable

We already know, for discrete random variables, how to define the conditional expectation with respect to a random varibale Y, E[g(X,Y)|Y]. Recall that if g(X,Y) is non-negative or integrable we have seen that E[g(X,Y)|Y] is a random variables of the form h(Y) and for any nonnegative h(Y) we have E[ E[ g(X,Y|Y] h(Y)]] = E[ g(X,Y) h(Y) ]

We now derive consider in greater generality to cover general RVs with arbitrary distribution. First we prove a simple resut.

Theorem 3.1 A random variable Z is measurable with respect to \sigma(Y) if and only if Z=h(Y) for some measurable function h.

:::{.proof} If Z=h(Y) then since \sigma( h(Y)) \subset \sigma(Y) Z is measurable with respect to \sigma(Y).

Conversely let us consider the class of random variables \mathcal{M}=\left\{ Z=h(Y)\,;\, h \text{ measurable} \right\}.

Note first \mathcal{M} contains all function of the form 1_A if A \in \sigma(Y). Indeed A \in \sigma(Y) means that A=X^{1}(B) for some B \in \mathcal{E} and thus 1_B=1_A\circ Y \in \mathcal{M}. Since Z_1,Z_2 \in \mathcal{M} implies that a_Z_1 + a_2 Z_2 \in \mathcal{M} then \mathcal{M} contains all simple functions of the form \sum_{i=1}^N a_i 1_{B_1} = \sum_{i=1}^N a_i 1_{A_1}\circ Y with A_i \in \sigma(X).

If Z_n \in \mathcal{M} are nonnegative and Z_n =f_n(Y) \nearrow Z then if we take f=\sup_{n}f_n we have z=f(Y). This shows that \mathcal{M}

y a monotone convergence argument

Definition 3.1 (Conditonal expectation) Suppose Z \ge 0 or Z \in \L^1(P). The conditional expectation expectation E[Z|Y] is a random variable which is measurable with respect to \sigma(Y) (i.e. it can be written as \psi(Y) for some measurable \psi) such that E[ E[Z|Y] U ] = E[Z U] for all bounded and \sigma(Y)-measurable function U.

We will prove later a more general theorem which implies the next result.

Theorem 3.2 For Z \ge 0 or Z \in \L^1(P) the conditional expectation E[Z|Y] exists and is unique P almost surely.

:::{.proof} The existence will be established in Section 3.2. As for uniqeness let us assume that h_1(Y) and h_2(Y) do satisfy the requirement. Then any bounded function of Y, U=g(Y) we have we have E[h_1(Y) U] = E[h_2(Y)U] We take now U=1_{h_1(Y)\ge h_2(Y)} and thus we obtain E[(h_1(Y) - h_2(Y)) 1_{h_1(Y)\ge h_2(Y)}] =0

3.2 Conditioning on a \sigma-algebra

4 The characteristic and moment generating function for a random variable

One of the oldest trick in the mathematician toolbox is to obtain properties of a mathemtical object by perfomring a transformation on that object to map it into another space. In analysis (say for ODE’s and PFE’s) the Fourier transform and the Laplace transform play avery important role. Both play an equally important role in probability theory!

  • Fourier transform of a probbaility measure leads to a proof of the central limit theorem!

  • Laplace transform (via Chernov bounds) leads to concentration inequalities and performance guarantees for Monte-Carlo methods and statistical learning.

4.1 Fourier transform and characteristic function

Notation For vectors x,y \in \mathbb{R}^n we use the notation \langle x, y\rangle = \sum_{i=1}^n x_i y_i for the scalar product in \mathbb{R}^n

Definition 4.1 (Fourier transform and characteristic function)  

  • For a probability measure P on \mathbb{R}^n the Fourier transform of P is a function \widehat{P}(t): \mathbb{R}^n \to \mathbb{C} given by
    \widehat{P}(t) = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP(x)

  • For a random variable X taking value in \mathbb{R}^n the characterictic function of X is the Fourier transorm of P^X (the distribution of X): we have \phi_X(t)= E\left[e^{i\langle t,x \rangle}\right] = \int_{\mathbb{R}^n} e^{i \langle t,x \rangle } dP^X(x)

Remarks

  • We have not talked explicitly about integration of complex valued function h=f+ig where f and g are the real and imaginary part. It is simply defined as \int (f+ig)dP = \int f dP + i \int g dP provided f and g are integrable. A complex function h is integrable iff and only if |h| is intergable if and only if f and g are integrable. (The only thing a bit hard to prove is the triangle inequality |\int h dP|\le \int|h| dP.)

  • The function e^{i\langle t,x \rangle} = \cos(\langle t,x \rangle) + i \sin (\langle t,x \rangle) is integrable since \sin(\langle t,x \rangle) and \cos(\langle t,x \rangle) are bounded function (thus in L^1 \subset L^\infty) or by noting that |e^{i\langle t,x \rangle}|=1.

  • Suppose the measure P has a density f(x) then we have \widehat{P}(t)= \int e^{i\langle t,x \rangle} f(x) dx which is simply the Fourier transform (usually denoted by \widehat{f}(t)) of the function f(x). (Diverse conventions are used for the Fourier, e.g. using e^{-i 2\pi\langle k,x \rangle} instead e^{i\langle t,x \rangle})

4.2 Analytic properties of the Fourier transform

We turn next to the properties of the Fourier transform. A very useful thing to remember of the Fourier transform

\text{ the smoother the Fourier transform is the faster the function (or the measure) decay and vice versa}

The next two theorem makes this explicit in the context of measures. The first one is very general and simply use that we dealing with probaility measure

Theorem 4.1 The Fourier transform \widehat{P}(t) of a probability measure P is uniformly continuous on \mathbb{R}, and satisfies \widehat{P}(0)=1 and |\widehat{P}(t)|\le 1.

Proof. Clearly \widehat{P}(0)= \int_{\mathbb{R}^n} dP(x)=1 since P is a probability measure and since |e^{i\langle t,x \rangle}|=1, by the triangle inequality |\widehat{P}(t)|\le 1.

For the uniform contintuity we have \begin{aligned} |\widehat{P}(t+h) -\widehat{P}(t)| \le \int |e^{i\langle (t+h),x \rangle} - e^{i\langle t,x \rangle} | dP(x) = \int |e^{i\langle t,x \rangle}| | e^{i\langle h,x \rangle} - 1 | dP(x) = \int | e^{i\langle h,x \rangle} - 1 | dP(x) \end{aligned} The right hand side is independent of t which is going to show uniformity. To conclude we need to show that the right hand-side goes to 0 as h \to 0. We can use dominated convergence since \lim_{h \to 0} e^{i\langle h,x \rangle} =1 \text{ for all } x \quad \quad \text{ and } \quad \quad | e^{i\langle h,x \rangle} - 1 | \le 2 \quad \square

As we have seen in ?@sec-inequalities the L^p spaces are form a dcereasing sequence L^1 \supset L^2 \supset L^3 \supset \cdots L^\infty and the next theorem show that if some random variable belongs to L^n (for some integer n) then its chararcteristic function will be n-times continuously differentiable.

Theorem 4.2 Suppose X is RV taking value in \mathbb{R}^n and is such that E[|X|^m] < \infty. Then the characteristic function \phi_X(t)=E[e^{i\langle t, X \rangle}] has continuous partial derivative up to oder m, and for any k \le m, \frac{\partial^k\phi_X}{ \partial{x_{i_1}}\cdots \partial{x_{i_k}} }(t) = i^k E\left[X_{i_1} \cdots X_{i_k} e^{i\langle t, X \rangle}\right]

Proof. We will prove only the m=1 case, the rest is proved by a tedious induction argument. Denoting by e_i the basis element \begin{aligned} \frac{\partial\phi_X}{ \partial x_i}(t) = \lim_{h \to 0} \frac{\phi_X(t+he_i)- \phi_X(t)}{h} = \lim_{h \to 0} E \left[ \frac{1}{h}\left(e^{i\langle t+ he_i, X \rangle} - e^{i\langle t, X \rangle}\right) \right] = \lim_{h \to 0} E \left[ e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h} \right] \end{aligned} To exchange the limit and expectation we use DCT argument and the bound |e^{i\alpha} -1| = \left| \int_0^\alpha \frac{d}{ds} e^{is} \right| \le \int_0^\alpha |i e^{is}|\le |\alpha|\,. From this we see that \left|e^{i\langle t, X \rangle} \frac{e^{i h X_i}-1}{h}\right| \le |X_i| which is integrable and independent of h. The DCT concludes the proof. \quad \square

4.3 More properties

Two simple but extremly useful properties:

Theorem 4.3 If X takes value in \mathbb{R}^n, b \in \mathbb{R}^m and A is a m\times n matrix then \phi_{AX+b}(t) = e^{i \langle t, b\rangle}\phi_{X}(A^Tt)

Proof. This simply follows from the equality
e^{i \langle t, AX +b \rangle} = e^{i \langle t,b \rangle} e^{i \langle A^T t,X \rangle}\,. \quad \square


Theorem 4.4 Suppose X and Y are independent RV taking values in \mathbb{R}^n then \phi_{X+Y}(t)=\phi_{X}(t) \phi_Y(t)

Proof. By independence E\left[ e^{i \langle t,X+Y \rangle} \right] =E\left[ e^{i \langle t, X \rangle} e^{i \langle t, Y \rangle}\right] =E\left[ e^{i \langle t, X \rangle}\right]E\left[ e^{i \langle t,Y \rangle}\right] \quad \square

4.4 Examples

  1. Bernoulli with parameter p: \phi_X(t)=E[e^{itX}]= e^{it}p + e^{i0}(1-p)= (1-p) + e^{it}p\,.

  2. Binomial with paramters (n,p): using the binomial theorem \phi_X(t)=E[e^{itX}]= \sum_{k=0}^n {n \choose k} e^{itk}p^{k} (1-p)^{n-k} =(e^{it}p + (1-p))^n

  3. Poisson with paramters \lambda: \phi_X(t)=E[e^{itX}]= e^{-\lambda} \sum_{k=0}^\infty e^{itk} \frac{\lambda^k}{k!} = e^{\lambda(e^{it}-1)}

  4. Normal with paramters \mu, \sigma^2: We start with the special case \mu=0 and \sigma^2=1 and we need to compute the complex integral \phi_X(t)=E[e^{itX}]= \frac{1}{\sqrt{2\pi}}\int e^{itx} e^{-x^2}{2} dx You can do it via contour integral and residue theorem (complete the square!). Instead we use an ODE’s argument.

First we note that by symmetry
\phi_X(t)= \frac{1}{\sqrt{2\pi}}\int \cos(tx) e^{-x^2/2} dx By Theorem 4.2 we can differentiate under the integral and find after integrating by part \phi_X'(t)=\frac{1}{\sqrt{2\pi}}\int -x \sin(tx)e^{-x^2/2} dx = \frac{1}{\sqrt{2\pi}}\int -t \cos(tx)e^{-x^2/2} dx =-t \phi_X(t) a separable ODE with initial condition \phi_X(0)=1. The solution is easily found to be e^{-t^2/2} and thus we have \phi_X(t)=E\left[e^{itX}\right]= e^{-t^2/2}. Finally noting that Y=\sigma X +\mu is a normal is with mean \mu and \sigma we find by Theorem 4.3 that \phi_Y(t)=e^{i\mu t}E\left[e^{i\sigma t X}\right]= e^{i\mu t -\sigma^2t^2/2}.

  1. Exponential with parameter \beta: For any (complex) z we have \int_a^b e^{z x} dx = \frac{e^{z b} - e^{za}}{z}. From this we deduce that \phi_X(t) = \beta \int_0^\infty e^{(it-\beta)x} dx = \frac{\beta}{\beta-it}

4.5 Uniqueness Theorem

We show now that the Fourier transform determines the probability measures uniquely, that is if two probability measures P and Q have the same Fourier transforms \widehat{P}=\widehat{Q} then they must coincide P=Q. For simplicity we only consider the 1-d case but the proof extends without problem.

There exists several version of this proof (see your textbook for one such proof). We give here a direct proof which also gives an explcit formula on how to reconstruct the measure from the its Fourier transform.

Our proof relies on the following computation of the so-called Dirichlet integral

Lemma 4.1 (Dirichlet integral) For T > 0 let S(T)= \int_0^T \frac{\sin{t}}{t} \, dt. We have then \lim_{T \to \infty} S(T) = \frac{\pi}{2}

This is a fun integral to do and can be done using a contour integral in the complex plane, or by a Laplace transform trick, or by the so-called Feynmann trick (add a parameter and differentiate). See for example the Wikipedia page.

We will take this result for granted and note that we have \int_0^T \frac{\sin(\theta t)}{t} dt = \rm{sgn}(\theta) S( |\theta| T) where {\rm sgn}(\theta) is +1,0 or -1 is \theta is positive, 0, or negative.

Theorem 4.5 (Fourier inversion formula) If a and b are not atoms for P we have P((a,b])= \lim_{T \to \infty} \frac{1}{2\pi} \int_{-T}^T \frac{e^{-ita} - e^{-itb}}{it} \widehat{P}(t) \, dt \tag{4.1} In particular distinct probability measures cannot have the same characteristic function.

Proof. The inversion formula implies uniqueness. The collections of (a,b] such that a and b are not atoms is a p-system which generates \mathcal{B} so the monotone class theorem implies the result, see ?@thm-uniquenesspm. (See exercise for more on atoms).

Let I_T denote the integral in Equation 4.1. Using the bound |e^{iz}-e^{iz'}|\le |z-z'|s we see that that the integrand is bounded and thus by Fubini’s theorem we have I_T = \frac{1}{2\pi} \int_{-\infty}^\infty \left( \int_{-T}^T \frac{e^{it(x-a)} - e^{it(x-b)}}{it} \, dt \right) dP(x) Using Euler formula and the fact that \cos is even \sin is odd we find \begin{aligned} I_T &= \int_{-\infty}^\infty \left( \int_{0}^T \frac{ \sin(t(x-a)) -\sin(t(x-b)) }{\pi t} \, dt \right) dP(x) \\ &= \int_{-\infty}^\infty \frac{{\rm sgn}(x-a)}{\pi} S(T|x-a|)-\frac{{\rm sgn}(x-b)}{\pi} S(T|x-b|) dP(x) \end{aligned} \tag{4.2}

The integrand in Equation 4.2 is bounded and converges as T \to \infty to the function \psi_{a,b}(x)= \left\{ \begin{array}{cl} 0 & x < a \\ \frac{1}{2} & x=a \\ 1 & a < x < b \\ \frac{1}{2} & x=b \\ 0 & x >b \\ \end{array} \right. By DCT we have that I_T \to \int \psi_{a,b} dP = P((a,b]) if a and b are not atoms. \quad \square

You can use the Fourier inversion formula to extract more information.

Theorem 4.6 Suppose the Fourier transform \widehat{P}(t) is integrable, \int |\widehat{P}(t)| dt < \infty then P has a density f(x)=\int e^{-itx} \widehat{P}(t) dt.

Proof. Using that |\frac{e^{-ita} - e^{-itb}}{it}|\le |b-a| the fact that |\widehat{P}(t)| is integrable means we can extend the integral in Equation 4.1 to an integral from -\infty to \infty. As a consequence we get P((a,b)) \le |b-a| \int_{-\infty}^\infty |\widehat{P}(t)| \, dt and thus P has no atoms. Furthermore for the CDF F(t) of P we have for h negative or positive \frac{F(x+h) - F(x)}{h} =\frac{1}{2\pi}\int_{-\infty}^\infty \frac{e^{-itx} - e^{-it(x+h)}}{ith} \widehat{P}(t) dt The integrand is dominated by |\widehat{P}(t)| and by DCT F is differentiable and P has density F'(x)= f(x)= \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx} \widehat{P}(t) dt. \quad \square

4.6 Examples

We can use the inversion theorem in creative ways.

Examples:

  1. The Laplace RV Y is a two-sided version of the exponential RV. Its density is f(x) = \frac{\beta}{2}e^{-\beta|x|}. You can think of the Laplace distribution as the mixture (with mixture parameters \frac12, \frac12) of an exponential RV X and -X where X is exponential. Its characteristic function is then \phi_Y(t)= E[e^{itY}]= \frac{1}{2}E[e^{itX}] + \frac{1}{2}E[e^{-itX}] = \frac{1}{2} \frac{\beta}{\beta -it} + \frac{\beta}{\beta +it} =\frac{\beta^2}{\beta^2 + t^2}

  2. The Cauchy RV Z had density f(x)=\frac{\beta}{\pi(x^2 + \beta^2)} and its characteristic function is given by \phi_Y(t)= E[e^{itY}]= \int_{\infty}^\infty e^{itx}\frac{\beta}{\pi(x^2 + \beta^2)} a priori not an easy integral. However notice that the Fourier transform of the Laplace looks (up to constants) exactly like the density of a Cauchy! So we using Theorem 4.6 for the Laplace distribution shows that \frac{\beta}{2} e^{-\beta|x|} = \frac{1}{2\pi}\int e^{-itx} \phi_Y(t) dt = \frac{1}{2\pi}\int e^{-itx} \frac{\beta^2}{\beta^2 + t^2}dt = \frac{\beta}{2} \int e^{itx} \frac{\beta}{\pi(\beta^2 + t^2)}dt =\frac{\beta}{2} \phi_Z(x) from which conclude that \phi_Z(t)= e^{-\beta|t|}.

4.7 Sum of independent random variables

Suppose X and Y are independent random variables. We wish to understand what is the distribution of X+Y. The first tool is to use the characteristic function and the fact that if X and Y are independent E[ e^{it(X+Y)}] = E[e^{itX}] E[e^{itY}] together with the uniquness theorem.

Examples

  1. Suppose X is normal with paramter \mu and \sigma^2 and Y normal with paramter \nu and \eta^2. Then if X and Y are independent then X+Y is normal with paramter \mu+\nu and \sigma^2 + \eta^2. This follows form the uniqueness theorem and E[e^{it(X+Y)} ] = E[e^{itX}] E[e^{itY}] = e^{i\mu t - \sigma^2t^2/2} e^{i\nu t - \eta^2t^2/2} = e^{i(\mu+\nu) t - (\sigma^2+\eta^2) t^2/2}

  2. Suppose X_1, \cdots, X_n are independent Bernoulli RV with paramters p then X_1+ \cdots + X_n is a binomial RV. Indeed we have E[e^{it(X_1+\cdots+X_n)} ]= E[e^{itX_1}] \cdots E[e^{itX_n}] =(e^{it}p +(1-p))^n

Another tool is the following convolution theorem

Theorem 4.7 (Convolution of probability measures) Assume X and Y are independent random variables.

  • If X and Y have distribution P^X and P^Y then X+Y has the distribution P^X \star P^Y (A) = \int \int 1_A(x+y) dP^X(x) dP^Y(y) \quad \text{ convolution product}
  • If X and Y have densities f_X(x) and f_Y(y) then X+Y has the density f_{X+Y}(z)= \int f_X(z-y) f_Y(y) dy = \int f_X(x) f_Y(z-x) dx

Proof. For the first part let us take a non-negtive function h and set Z=X+Y. We have then E[h(Z)]= E[h(X+Y)]= \int h(x+y) P^X(dx) P^Y(dy) Taking h=1_A give the result.

For the second part if X and Y have a density we have \begin{aligned} P(Z \in A) =E[1_A(Z)] & = \int \int 1_A(x+y) f_X(x) f_Y(y) dx dy \\ & = \int \int 1_A(z) f_X(z-y) f_Y(y) dz dy \quad \text{ change of variables } z=x+y, dz = dx \\ &= \int \left(\int f_X(z-y) f_Y(y) dy\right) 1_A(z) dz \quad \text{ Fubini } \end{aligned} and since this holds for all A, Z has the claimed density. The second formula is proved in the same way. \quad \square.

Example: triangular distribution
Suppose X and Y are independent and uniformly distributed on [-\frac{1}{2},\frac12]. Then Z=X+Y is between -1 and +1 for z \in [-1,1] we have f_Z(z)= \int_{-\infty}^\infty f_X(z-y) f_Y(y) dy = \left\{ \begin{array}{cl} \int_{-\frac12}^{z+ \frac12} dy & -1\le z \le 0 \\ \int_{z-\frac{1}{2}}^{\frac12} dy & 0 \le z \le 1 \end{array} = 1- |z| \right.

4.8 Moment and moment generating functions

The moment problem is the question whether a probability distribution P is uniquely determined by all its moments E[X^n]. It is in general not true as the following examples shows.

  • Recall the log-normal distribution is the distribution of e^X is X is a normal distribution. Its pdf is given, for \mu=0 and \sigma^2=1 f(x)= \frac{1}{x \sqrt{2 \pi}} e^{- \frac{\ln(x)^2}{2}} and all moments exists E[X^r]=\int_0^\infty x^k f(x) dx = e^{r^2/2}.

  • Now consider g(x) = f(x)(1 + \sin(2 \pi \ln(x))) Then for k=0,1,2, \cdots we have with the change of variable \ln(x)= s+k \int_0^\infty x^k f(x) \sin(2 \pi \ln(x)) dx = \frac{1}{\sqrt{2 \pi}} e^{k^2/2}\int_{-\infty}^\infty e^{-s^2/2} \sin(2 \pi s) ds =0 \,. This shows that g is the density of a RV Y and that all moments of Y coincide with the moments of the log-normal!

A stronger condition on the moments do imply uniqueness: if all moments exists and E[X^n] do not grow to fast with n then the moments do determine the distribution. This use a analytic continuation argument and relies on the uniqueness theorem for the Fourier transform.

Theorem 4.8 Suppose X and Y are RV such that the moment generating functions M_X(t)=M_Y(t) in [-t_0,t_0] and are finite in that interval. Then X and Y have the same distribution.

Proof.

  • Since e^{t|x|} \le e^{tx}+ e^{-tx} and the right hand-side is integrable, the function e^{t|X|}=\sum_{k=0}^\infty \frac{s|X|^k}{k!} is integrable. By the DCT (for sums of RVs, in the form of ?@exr-62) we have E[e^{tX}]=\sum_{k=0}^\infty \frac{t^k E[X^k]}{k!}.

  • This implies that \frac{t^k E[X^k]^k}{k!} \to 0 as k \to \infty (for |t|\le t_0). We claim that this implies that \frac{s^k E[|X|^k]^k}{k!}\to 0 as k \to \infty as long as s < t_0. If k is even E[|X|k]=E[X^k] and there is nothing to do. For k odd we use on one hand that |X|^{2k-1} \le 1 + |X|^2k as well that s< t we have (for k sufficiently large) 2k s^{2k-1} < t^2k. Together this shows \frac{s^{2k-1} E[|X|^k]^k}{k!} \le \frac{t^{2k} E[X^k]^k}{k!} \quad \text{ for $k$ large enough.}

  • The next piece is the Taylor expansion theorem with reminder for function which are n-times continuously differentiable f(x)= \sum_{k=0}^n f^{(k)}(x_0) \frac{(x-x_0)^k}{k!} + \int_{x_0}^x \frac{f^{(n+1)}(t)}{n!}(x-s)^n dt from which we obtain
    \left|e^{itx}\left( e^{ihx} - \sum_{k=0}^n \frac{(ix)^k}{k!}\right)\right| \le \frac{|h x|^{n+1}}{(n+1)!}

  • Integrating with respect to P and taking n \to \infty together with Theorem 4.2 gives \phi_X(t+h)= \sum_{k=0}^\infty \phi_X^{(k)}(t) \tag{4.3} for |h| < t_0.

  • Now suppose X and Y have the same moment generating function in a neighborhood of 0. Then all their moments coincide and thus by Equation 4.3 (with t=0), \phi_X(t)=\phi_Y(t) on the interval (-r_0, r_0). By Theorem 4.2 their derivatives must also be equal on (-t_0,t_0). Using now Equation 4.3 (with t=-t_0+\epsilon and t=t_0-\epsilon shows that \phi_X(t)=\phi_Y(t) on the interval (-2t_0, 2t_0). Repeating this argument \phi_X(t)=\phi_Y(t) for all t and thus by Theorem 4.5 X and Y must have the same distribution. \quad \square

4.9 Exercises

Exercise 4.1  

  • Show that a characteristic function \phi_X(t) satifies \overline{\phi_X(t)}=\phi_X(-t) (complex conjugate).

  • Show that a characteristic function \phi_X(t) is real if and only if the random variable X is symmetric (i.e X and -X have the same distribution)

  • Show that if \phi_X is the characteristic function for some RV X then \phi_X^2(t) and |\phi_X(t)|^2 are characteristic function as well. What are the corresponding RVs?

Exercise 4.2 In this problem we study the characteristic function for a Gamma random variable with parameter \alpha and \beta and density \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}. In particular you will prove that \phi_X(t)=\left(\frac{\beta}{\beta-it}\right)^\alpha.

  • First show that it is enough to consider the case \beta=1 (change scale.)

  • Use the moments E[X^n] to show that \phi_X(t)= \sum_{n=0}^\infty (it)^n \frac{\Gamma(\alpha+n)}{\Gamma(\alpha) n!} and use then the binomial series for (1-it)^{-\alpha}.

  • Use your result to show that

    • The sum of two independent gamma random variables with parameters (\alpha_1,\beta) and (\alpha_2,\beta) is a gamma random variable.

    • If X_1, X_2, \cdots, X_n are independent normal random variable with mean 0 and variance \sigma^2. show that X_1^2 + \cdots + +X_n^2 is a gamma random variable and find the parameters.

Exercise 4.3 Show that if X and Y are RV values taking in the positive integers with distributions P^X(n) and P^Y(n) and are independent then X+Y has distribution P^{X+Y}(n)=\sum_{k=0}^k P^X(k) P^Y(n-k) (this is called the convolution product of the two sequences P^X and P^Y).

Exercise 4.4  

  • In Theorem 4.5, modify the statement of the theorem if a or b are atoms.

  • Show that P(\{a\}) = \lim_{T \to \infty} \frac{1}{2T}\int_{-T}^T e^{-ita} \widehat{P}(t) dt Hint: Imitate the proof of the inversion formula in Theorem 4.5

  • Suppose a RV X takes integer values in \mathbb{Z}, show that P(X=n)= \frac{1}{2\pi}\int_{-\pi}^{\pi} e^{-itn} \phi_X(t) dt Hint: Show that \phi_X(t) is periodic