Math 606, Spring 2024
University of Massachusetts Amherst
2024-05-07
Martingales are models of fair games and to understand then we need to understand first conditional expectations. Conditional expectations is a very useful concept to understand how information obtained from measurement can be incorporated to make predictions.
Suppose we are given a random variable Y. If we know nothing about the outcome of experiment generating Y then our best guess for the value of Y would be the expectation E[Y]. On the contrary if we measure Y itself then our prediction would Y itself! Conditional expectations deals with making best guesses on the possible value of Y when we have some partial information which is described by some collection of other random variables X_1, X_2, \cdots, X_n.
Example: discrete RV: Suppose X and Y are discrete random variables with joint density and marginals \text{ joint pdf } \quad p(x,y)= P(X=x,Y=y) \quad \text{marginals } \quad p(x) =\sum_{y} p(x,y)\,,\, \quad p(y) =\sum_{x} p(x,y) To define the conditional expectation E[Y|X] we need to give the best guess for Y given that we have observed X=x which is E[Y|X=x] = \sum_{y} y P(Y=y| X=x) = \sum_{y} y\, \frac{P(Y=y, X=x)}{P(X=x)} = \sum_{y} y \,\frac{p(x,y)}{p(x)} which is well defined for those x with p(x)>0.
The function E[Y|X=x] defines a function of the random variable X which we define to be E[Y|X]. For example if we roll two independent dice and X is value of the first roll and Y the sum of the two rolls
then we have
f(x,y) = \frac{1}{36}\,, \quad x=1,2,\cdots, 6 \quad y= x+1, \cdots, x+6.
and this
E[Y|X=x] = x + \frac{7}{2}
so that E[Y|X]=X + \frac{7}{2}.
In a similar way we can define E[Y|X_1,\cdots, X_n] for discrete RV with joint pdf p(x_1, \cdots, x_n, y).
Example: continuous RV: In a similar way, if Y,X_1, \cdots X_n are continuous RV with a joint pdf f(x_1, \cdots, x_n,y) with marginal f(x_1, \cdots,x_n)=\int f(x_1, \cdots, x_n,y) dy then the function y \mapsto \frac{f(x_1, \cdots, x_n,y)}{f(x_1, \cdots,x_n)} defines a probability density function provided (x_1, \cdots,x_n) is such that f(x_1, \cdots,x_n)\not=0. Then the expectation E[Y|X_1=x_1 \cdots X_n=x_n] = \int y \frac{f(x_1, \cdots, x_n,y)}{f(x_1, \cdots,x_n)} \, dy defines a function of (x_1, \cdots,x_n). We leave this function undefined whenever f(x_1, \cdots,x_n)=0 (or set it to 0 if you prefer).
Any function of h(x_1,\cdots,x_n) can be used to define a RV h(X_1, \cdots,X_n). Note that, as a RV, this
does not depend on how the function is defined when f(x_1, \cdots,x_n)=0 since such x have probability 0.
We call the corresponding random variable E[Y|X_1,\cdots, X_n] and call it the conditional expectation of Y given X_1,\cdots, X_n.
The conditional expectation has the following properties
E[Y|X_1 \cdots X_n] depend only on X_1, \cdots, X_n in the sense that it is function h(X_1,\cdots,X_n). In the language of measure theory E[X|Y_1,\cdots,Y_n] is a measurable function with respect to X_1, \cdots, X_n, or better with respect to the \sigma-algebra generated by X_1, \cdots, X_n.
Suppose that A is an event which depends only on X_1, \cdots, X_n, for example the rectangle A = \left\{ a_i \le X_i \le b_i \,,\, i=1, \cdots,n \right\} and let 1_A be the corresponding indicator function. Then E[ Y 1_A ] = E[ E[Y|X_1,\cdots, X_n] 1_A ]
To prove the second property note that \begin{aligned} E[ E[Y|X_1,\cdots, X_n] 1_A ] & = \int E[Y|X_1=x_1, \cdots, X_n=x_n] 1_A(x_1, \cdots, x_n) f(x_1, \cdots, x_n) dx_1 \cdots dx_n \\ & = \int_A \left(\int y \frac{f(x_1, \cdots, x_n, y)}{f(x_1, \cdots, x_n)} dy \right) f(x_1, \cdots, x_n, y) dx_1 \cdots dx_n \\ & = \int_A \int y f(x_1, \cdots, x_n, y) dy dx_1 \cdots dx_n \\ &= E[ Y 1_A ] \end{aligned}
What we have demonstrated with examples are instance of a general theorem. We use the notation \mathcal{F}_n to denote all the information contained in the random variables X_1, \cdots, X_n. We say that a random variable Z is [measurable with respect to \mathcal{F}_n] if Z=h(X_1, \cdots, X_n) can be expressed as a function of (X_1, \cdots, X_n). We say that a set A is measurable with respect to \mathcal{F}_n if the function 1_A is measurable with respect to \mathcal{F}_n. This simply means that A should be specifiued using the random variable X.
Theorem 1.1 Let Y and X_1, \cdots, X_n be random variables on a probability space (\Omega, \mathcal{F}, \mathcal{P}) and assume that E[|Y|]<\infty. Let \mathcal{F}_n be the \sigma-algebra generated by X_1, \cdots, X_n. Then there exists a unique random variable E[Y|X_1, \cdots, X_n] such that
E[Y|X_1, \cdots, X_n] is measurable with respect to \mathcal{F}_n.
For any A measurable with respect to \mathcal{F}_n we have E[ Y 1_A ] = E[ E[Y|X_1, \cdots, X_n] 1_A ] \quad \text{ for all } A \in \mathcal{F}_n \,.
From now we use the abbreviated notation for the conditional expectation E[Y| \mathcal{F}_n] = E[Y|X_1, \cdots, X_n]
The conditional expectation has the following properties
Theorem 1.2 The conditional expectation has the following properties
Linearity: E[a_1Y_1 + a_2Y_2| \mathcal{F}_n] =a_1E[Y_1| \mathcal{F}_n] + a_2 E[Y_2| \mathcal{F}_n]
If Y=g(X_1,\cdots,X_n) then E[Y|\mathcal{F}_n] = Y
If Y is independent of X_1,\cdots,X_n then E[Y|\mathcal{F}_n] = E[Y]
If m < n then E[ E[Y|\mathcal{F}_n]|\mathcal{F}_m] = E[Y|\mathcal{F}_m].
If Z=g(X_1,\cdots,X_n) then E[Y Z|\mathcal{F}_n] = Z E[Y|\mathcal{F}_n].
Proof. The idea is to use the uniqueness statement in Theorem 1.1.
For part 1. E[Y_i| \mathcal{F}_n] are the unique \mathcal{F}_n measurable random variables such that E[Y_i 1_A] = E[ E[Y_i| \mathcal{F}_n] 1_A] and so by linearity a_1 E[ E[Y_1| \mathcal{F}_n] 1_A] + a_2 E[ E[Y_1| \mathcal{F}_n] 1_A] = a_1 E[Y_1 1_A]+ a_2 E[Y_2 1_A] = E[(a_1 Y_1 +a_2Y_2) 1_A] and by uniqueness we must have E[a_1Y_1 + a_2Y_2| \mathcal{F}_n] =a_1E[Y_1| \mathcal{F}_n] + a_2 E[Y_2| \mathcal{F}_n].
For part 2. if Y=g(X_1,\cdots,X_n) then Y itself satisfies the definition.
For part 3. if Y is independent of X_1,\cdots,X_n then by independence and linearity E[Y 1_A] = E[Y] E[1_A] = E[ E[Y] 1_A] which, by uniqueness, proves the statement.
For part 4. note that E[ E[Y|\mathcal{F}_n]|\mathcal{F}_m] and E[Y|\mathcal{F}_m] both depend only on X_1, \cdots X_m. Moreover if A is \mathcal{F}_m measurable and m \le n then it is also \mathcal{F}_n measurable. So we have E[ E[Y|\mathcal{F}_m] 1_A] = E[ Y 1_A ] = E[ E[Y|\mathcal{F}_n] 1_A] = E[ E[E[Y|\mathcal{F}_n]|\mathcal{F}_m] 1_A ] which, by uniqueness, proves the statement.
Finally for 5. if Z=1_B and B is \mathcal{F}_n measurable then E[ E[Y 1_B|\mathcal{F}_n] 1_A ] =E[ Y 1_B 1_A ]= E[ Y 1_{A\cap B} ] = E[E[Y |\mathcal{F}_n]] 1_B 1_A] which proves the statement. For general Z one use an approximation argument by simple functions.
Example Suppose X_i are IID random variables with \mu=E[X_i] and let S_n= X_1 + \cdots + X_n. If we take m < n then we have \begin{aligned} E[S_n|\mathcal{F}_m] & = E[ X_1+ \cdots + X_m|\mathcal{F}_m ] + E[ X_{m+1}+ \cdots + X_n|\mathcal{F}_m ] \\ & = X_1+ \cdots + X_m + E[ X_{m+1}+ \cdots + X_n] = S_m + (n-m)\mu \end{aligned} since X_1 + \cdots + X_m is \mathcal{F}_m measurable and X_{m+1} + \cdots + X_n is independent of X_1, X_2, \cdots, X_m (see Theorem 1.2, properties 2 and 3.)
Example Let S_n as in the previous example and assume that \mu=0 and let \sigma^2=V(X_i) be the variance. If we take m < n we find, using properties 2. 3. and 5. of Theorem 1.2. \begin{aligned} E[S_n^2|\mathcal{F}_m] & = E[ (S_m + (S_n-S_m))^2|\mathcal{F}_m ] \\ &= E[ S_m^2 |\mathcal{F}_m ] +2 E[ (S_n-S_m) S_m | \mathcal{F}_m ] + E[(S_n-S_m)^2|\mathcal{F}_m ] \\ & = S_m^2 + 2 S_m E[ (S_n-S_m)| \mathcal{F}_m ] + E[(S_n-S_m)^2] \\ & + S_m^2 +2 S_m E[ S_n-S_m] + E[(S_n-S_m)^2] \\ & = S_m^2 + V(S_n-S_m) =S_m^2 +(n-m)\sigma^2 \end{aligned}
Example If X_i are Bernoulli RV and m <n let us compute E[S_m|S_n]. Note first that P(X_1=1| S_n=k) =\frac{p {n-1 \choose k-1} p^{k-1}(1-p)^{n-k}}{{n\choose k}p^k (1-p)^{n-k}} = \frac{k}{n} \implies E[X_1|S_n] = \frac{S_n}{n} and thus E[S_m|S_n] = \frac{m}{n} S_n \,.
Exercise 1.1 (Conditional expectation as a projection)
Show the following: if E[|Y|^2]< \infty (which ensures that all expectations exists) then E[Y|\mathcal{F}_n] is is the random variable measurable with respect to \mathcal{F}_n which minimizes the mean square error, that is it solves \min \{ E[(Y - Z)^2] \,:\, Z \,\,\mathcal{F}_n-\text{measurable} \} Hint: Write Z as Z=E[Y|\mathcal{F}_n]+ W and expand the square.
Suppose E[Y^2]< \infty and m \le n. Show that E[ (Y-E[Y|\mathcal{F}_n])^2 ] + E[ (E[Y|\mathcal{F}_n]- E[Y|\mathcal{F}_m])^2 ] = E[ (Y-E[Y|\mathcal{F}_m])^2 ] in particular E[ (Y-E[Y|\mathcal{F}_n])^2 ] is a decreasing function of n.
Suppose E[Y^2]< \infty and define the conditional variance to be the random variable V(Y|\mathcal{F}_n)= E[Y^2|\mathcal{F}_n] - E[Y|\mathcal{F}_n]^2. Show that V(Y)= E[ V(Y|\mathcal{F}_n)] + V (E[Y|\mathcal{F}_n])
Definition 2.1 Consider a collection of random variables X_1, X_2, \cdots. A sequence of random variables M_0, M_1, M_2, \cdots is called a Martingale with respect to the filtration \mathcal{F}_n if
E[|M_n|]< \infty for all n.
M_n is measurable with respect to \mathcal{F}_n
For each m < n we have E[M_n|\mathcal{F}_m]=M_m
Remark To verify the martingale porperty 3. in Definition 2.1 it is enough to check that E[M_{n+1}|\mathcal{F}_n]=M_n \text{ for all } n since this property implies that, by item 4. in Theorem 1.1 E[M_{n+2}|\mathcal{F}_n]= E[E[M_{n+2}|\mathcal{F}_{n+1}]|\mathcal{F}_n] = E[M_{n+1}|\mathcal{F}_n]=M_n and so on.
Example One of the prototype of a martingale is given by sum of IID random variables. Let
S_0=0\,, \quad S_n=X_1 + \cdots X_n
As we have seen before, if m< n,
E[S_n|\mathcal{F}_m]= S_m +(n-m)\mu
This implies that M_n= S_n-n\mu is a martingale.
Example Suppose Y is a random variable with E[|Y|]< \infty. Then we can build a martingale with respect to \mathcal{F}_n by setting M_n = E[Y|\mathcal{F_n}] Indeed we have, for m \le n E[M_n|\mathcal{F_m}] = E[E[Y|\mathcal{F_n}]\mathcal{F_m}]= E[Y|\mathcal{F}_m]=M_m In that case we say that the martingale is closed by the random variable Y and we can think of M_n as succesively “better” approximation of Y as we incorporate more and more information.
Suppose we are playing a sequence of independent fair game with two outcomes (e.g betting on the flip a fair coin.) We describe this by RV X_i such that P(X_i=+1) = P(X_i=-1)= \frac{1}{2} The RV describe the winning obtain by betting an amount of 1 on the i^{th} game and the game is fair since E[X_i]=0.
A betting sequence is a sequence of RV B_n such that
B_n is the amount of money bet on the n^{th} game.
B_n is measurable with respect to \mathcal{F_{n-1}}.
E[|B_n|]=\infty
The second property means that the way you bet on the n^{th} game is guided by the past outcomes of the n-1 previous bets. No peeking into the future allowed! The winnings after n games is given by W_n = \sum_{k=1}^n B_n X_n \quad \text{with} W_0=0 and is a martingale.
Clearly W_n is \mathcal{F}_n measurable and E[|W_n|]< \infty. Moreover we have \begin{aligned} E[W_{n+1}|\mathcal{F}_n]& = E[B_{n+1} X_n| \mathcal{F}_{n}] + E[W_n| \mathcal{F}_{n}] = B_{n+1} E[X_n| \mathcal{F}_{n}] + W_n = W_n \end{aligned} where we have used that B_{n+1} is \mathcal{F}_m measurable and E[X_n]=0.
The martingale property implies that E[W_n] is constant: we have E[W_n] = E[ E[W_n|\mathcal{F}_{n-1}]] =E[W_{n-1}] which means that your expected winning in fair game is zero.
But this is not the end of the story. In a betting strategy you will do a sequence of bet and decide of a good moment when you actually stop betting. Consider for example the following well known strategy (often called the martingale strategy): double your bet until you win. If you win the first game you stop and take you gain of W_1=1. If you lose the first game you bet 2 on the second game. If you win the second game you winning is W_2=-1+2 =1 and then stop. If you lose the first two games you know bet 4 on the third game abnd if you win the third game you winning is W_3=-1-2+4=1, and so on… We have then the transition probabilities P(W_{n+1}=1|W_n=1)=1 and P(W_{n+1}=1|W_n=-(2^n-1))=\frac{1}{2} P(W_{n+1}=-(2^{n+1}-1)|W_n=-(2^n-1))=\frac{1}{2} and this is a martingale. It is true that E[W_n]=0 but however when you stop, which happens at a random time T, you always win $1! The time T at which you first win occurs has here a geometric distribution. We will consider stopping time in the next section.
Consider an urn with balls of two colors, say red and green. Assume that initially there is one ball of each color in the urn. At each time step, a ball is chosen at random from the urn. If a red ball is chosen, it is returned and in addition another red ball is added to the urn. Similarly, if a green ball si chosen, is is returned together with another green ball.
Let X_n denote the number of red balls in the urn after n draws. Then X_0=1 and X_n a (time-inhomogeneous) Markov chain with transitions
P(X_{n+1}=k+1| X_n=k) = \frac{k}{n+2} \quad \quad P(X_{n+1}=k| X_n=k) = \frac{n+2 -k}{n+2}
We now define
M_n= \frac{X_n}{n+2} \quad \text{ fraction of red balls after $n$ draws}
Then M_n is a martingale since
E[X_{n+1}|X_n=k]= (k+1)\frac{k}{n+2} + x \frac{n+2 -k}{n+2} = \frac{k}{n+2} + k \implies E[X_{n+1}|X_n] = X_n + \frac{X_n}{n+2}.
Since this is a Markov chain,
E[M_{n+1}|\mathcal{F}_n]=E[M_{n+1}|X_n]= E\left[\frac{M_{n+1}}{n+3}|X_n\right]= \frac{1}{n+3}\left(X_n + \frac{X_n}{n+2}\right) =\frac{X_n}{n+2}=M_n
There is a natural connection between Markov chain and martingale. We explain this in the context of Markov chains but this is just an example. Consider a function f:S \to \mathbb{R} and let us derive an equation for E[f(X_t)]. Using Kolmogorov equation we have \frac{d}{dt}E[f(X_t)] = \frac{d}{dt} \sum_{i} f(i) p_t(i) = \sum_{i} f(i) (p_t A)(i) = \sum_{i} f(i) \sum_{j} p_t(j) A(j,i) = \sum_{j} \sum_{i} A(j,i) f(i) p_t(j) and thus \frac{d}{dt}E[f(X_t)] = E[ g(X_t)] \quad \text{ where } g = Af Integrating the previous equation we find E[f(X_t)] - E[f(X_0)] = \int_0^t E[ Af(X_s)] ds
There is a martingale hidden in this equation. Indeed consider the random variables Y_t = f(X_t) - f(X_0) - \int_0^t Af(X_s) \, ds Our previous calculation shows that E[Y_t]=0, moreover by the Markov property we have E[Y_{t+s} -Y_t | \mathcal{F}_t ] = E[ f(X_{t+s}) - f(X_t) - \int_t^{t+s} Af(X_u) du | X_t ] =0
Exercise 2.1
Suppose X_1, X_2, \cdots are IID random variable with E[X_i]=1. Show that M_n = X_1 X_2 \cdots X_n is a martingale.
Suppose the RV X has pdf f(x) and the RV Y has PDF g(x). Show that M_n= \prod_{j=1}^n \frac{g(X_i)}{f(X_i)} is a martingale with respect to X_1, X_2, \cdots where X_i are IID with pdf f(x). This martinagle is called the likelihood ratio martingale.
Suppose X_1, X_2, \cdots are IID random variable with moment generating function M(t)=E[e^{tX_i}] and let S_n=X_1 + \cdots + X_n. Show that M_n = \frac{e^{tS_n}}{M(t)^n} is a martingale.
Exercise 2.2 Suppose N_t is a Poisson process with rate \lambda. Show that N_t-\lambda t is a martingale.
Exercise 2.3 Consider a branching process X_n with mean offspring number \mu and extinction probability a. Show that
M_n = X_n \mu^{-n} is a martingale.
M_n = a^{X_n} is a martingale.
Exercise 2.4 Show (by induction) that for the polya’s urn we have P(X_n=k+1) = \frac{1}{n+1} \text{ for } k=1, 2, \cdots Show that M_n converges to M_\infty in distribution. Find M_\infty.
A stopping time T with respect to a sequence of random variables X_0, X_1, \cdots should be such that the random time at which you decide to stop depends only on the information you have accumulated so far. If you decide to stop at time n then it should depend only on X_0, \cdots, X_n and you are not allowed to peek into the future.
Definition 3.1 A stopping time T wit respect to the filtration \mathcal{F}_n is a random variable taking values \{0,1,2,\cdots, +\infty\} such that for any n the event \{T=n\} is measurable with respect to \mathcal{F}_n.
Example: T=k is a stopping time.
Example: The hitting time T_A =\inf\{j, X_j \in A \}, for some set A, is a stopping time.
Example: If T and S are stopiing time then \min\{S,T\} is also a stopping time. In particular T_n = \min\{T,n\} is a bounded stopping time since T_n \le n and we have T_0 \le T_1 \le \cdots \le T_n \le T.
The optional sampling theorem says roughtly speaking that “you cannot beat a fair game” which means that if M_n is a martingale and T is a stopping time then E[M_T]=E[M_0]. This is not true in general as we have seen in form the martingale betting system where 1=E[M_T]\not= E[M_0]=0. We start with the follwoing result
Theorem 3.1 (Optional Sampling Theorem (T bounded)) Suppose M_n is a martingale and T is a bounded stopping time (i.e. T \le K) then E[M_T|\mathcal{F}_0]=M_0 and in particular E[M_T]=M_0.
Proof. Since T\le K we can write M_{T}=\sum_{j=0}^K M_j 1_{\{T=j\}} We compute next E[M_T|\mathcal{F}_{K-1}]. Note that since T is bounded by K we have 1_{\{T=K\}}= 1_{\{T> K-1\}} which is measurable with respect to \mathcal{F}_{K-1}. Therefore we find
\begin{aligned} E[M_{T}|\mathcal{F}_{K-1}] & = E[ M_K 1_{\{T> K-1\}} | \mathcal{F}_{K-1}] + \sum_{j=0}^{K-1} E[ M_j 1_{\{T=j\}}|\mathcal{F}_{K-1}] \\ & = 1_{\{T> K-1\}} E[ M_K | \mathcal{F}_{K-1}] + \sum_{j=0}^{K-1} M_j 1_{\{T=j\}} = 1_{\{T> K-1\}} M_{K-1} + \sum_{j=0}^{K-1} M_j 1_{\{T=j\}} \\ & = 1_{\{T> K-2\}} M_{K-1} + \sum_{j=0}^{K-2} M_j 1_{\{T=j\}} \end{aligned} where we used that M_j 1_{\{T=j\}} is \mathcal{F}_{K-1} measurable if j\le k-1.
Using this we find \begin{aligned} E[M_{T}|\mathcal{F}_{K-2}]&=E[ E[M_{T}|\mathcal{F}_{K-1}]|\mathcal{F}_{K-2}]\\ & = E[ 1_{\{T> K-2\}} M_{K-1}|\mathcal{F}_{K-2}]+ \sum_{j=0}^{K-2} E[M_j 1_{\{T=j\}}|\mathcal{F}_{K-2}] \\ &= 1_{\{T> K-3\}} M_{K-2} + \sum_{j=0}^{K-3} M_j 1_{\{T=j\}} \end{aligned} and thus, inductively, E[M_T|\mathcal{F}_0]= M_0 \quad \blacksquare
To prove a more general version let us assume that P(T < \infty)=1. Then T_n=\min\{T,n\} converges to T and we can write M_{T} = M_{T_n} 1_{\{T\le n\}} + M_{T} 1_{\{T > n\}} = M_{T_n} + M_{T} 1_{\{T >n \}} - M_n 1_{\{T >n \}}
Since T_n is bounded by the optional sampling theorem we have E[M_{T_n}]=M_0. But we need then to control the remaining two terms.
If we assume that M_T is integrable, E[|M_T|]< \infty, the assumption P(T < \infty)=1 means that 1_{\{T>n\}} converges to 0 and thus by the dominated convergence theorem we have \lim_{n \to \infty} E[M_{T} 1_{\{T >n \}}]=0.
The third term is more troublesome. Indeed for the martingale betting systems, if T_n \ge n it means we lost n bets in a row which happens with a probbaility of \frac{1}{2^n} for a total loss of -1 -2 - \cdots - 2^{n-1}=-(2^n-1). Therefore E[M_n| 1_{\{T> n\}}] = \frac{1}{2^n}(1-2^n) \to 1 \text{ as } n \to \infty It does not converge to 0 but to 1 in accordance with the result E[M_T]=1.
These considerations leads to the following
Theorem 3.2 (Optional Sampling Theorem (general version)) Suppose M_n is a martingale and T is a finite stopping time (i.e. P(T < \infty)=1). If E[|M_T|<\infty] and \lim_{n \to \infty}E[ |M_n| 1_{\{T>n\}}] =0 then E[M_T]=M_0.
Remark If the sequence M_n is uniformly bounded, i.e. |M_n| \le C then the optional sampling theorem holds since
E[ |M_n| 1_{\{T>n\}}] \le C P(T >n).
Remark Another condition which gurantees the optional sampling theorem is if \sup_{n}E[M_n^2] < \infty. Indeed if this holds given \epsilon>0 we have \begin{aligned} E[|M_n| 1_{T>n}] & = E[|M_n| 1_{|M_n| > C/\epsilon} 1_{T>n}] + E[|M_n| 1_{|M_n| \le C/\epsilon} 1_{T>n}] \\ & \le \frac{\epsilon}{C} E[|M_n|^2 1_{\{T>n\}} ] + \frac{C}{\epsilon}P(T >n) \le \epsilon + \frac{C}{\epsilon}P(T >n) \end{aligned} Taking n \to \infty shows that \lim_{n} E[|M_n| 1_{\{T>n\}}] \le \epsilon.
Example: Gambler’s ruin probability. Define S_0=a$ and S_n = a+ X_1 + \cdots + X_n where X_i arre IID fair bets P(X_i=-1)=P(X_i=1)=\frac{1}{2}.
Then S_n is a martingale and consider the stopping time
T = \min \{n :S_n =0 \text{ or } s_n=N \}
which describe the time at which a gambler starting with a fortune a either go bankrupt or reach a fortune of N. Note that if T>n then S_n \le N and thus E[|S_n| 1_{\{T> n\}}] \le N P(T >n) \to 0 as n \to \infty.
E[S_T] = E[S_0]=a.
But we get then
a= E[S_T]= N P(S_T=N) \quad \implies P(S_T=N)= \frac{a}{N}
This gives another (computation free) derivation of the gambler’s ruin formula! The case p\not= \frac{1}{2} can also be treated using a (different) martingale and will be considered in the homework.
Example: Gambler’s ruin playing time. Suppose S_n is like in the previous example. Then M_n=S_n^2 -n is also martingale since
E[S_{n+1}^2- {n+1} | \mathcal{F_n}]= E[S_n^2 + 2X_{n+1}S_n + X_{n+1}^2- {n+1} | \mathcal{F_n}] = S_n^2 + 1 - (n+1) = S_n^2 -n
To apply the optional sampling theorem we note that P(T>n) \le C \rho^n since T is a hiting time for a finite state Markov chain.
Therefore we have E[|S_n| 1_{\{T> n\}}] \le (N^2 + n^2) C \rho^n \to 0 and we can apply the optional sampling theorem and
E[M_T] =E[M_0]=a^2
But, using the previous example we find
a^2= E[M_T] = E[S_T^2] - E[T] = N^2 P(S_T=N) - E[T] = aN - E[T]
and thus
E[T]= a(N-a)
which gives the expected length of play.
Example: Random harmonic series It is well-known that the harmonic series 1 + \frac{1}{2} + \frac{1}{3} + \cdots diverges while the alternating harmonic series 1 - \frac12 + \frac13 - \cdots converges. What does happen if we chose the sign in the series randomly? To see this let X_i be IID random variables with P(X_i=-1)=P(X_i=1)=\frac{1}{2}. Let M_0=o and for n \ge 0 define M_n = \sum_{j=1}^n \frac{1}{j}X_j is a martingale with E[M_n]=0 and E[M_n^2] = \sum_{j=1}^n \frac{1}{j^2} < \infty.
We can use martingale like for the gambler’s ruin to analyze much more complicated models. The voter model is a simple opinion dynamics model. Here are the ingredients.
A graph G=(V,E) is given. At each vertex of the graph is an agent who can be either in state 0 or in state 1. We view this as two different possible opinions for the agent. We describe the state of the system by a vector {\bf \sigma} = \left( \sigma(v)\right)_{v \in V} \quad \text{ with } \sigma(v) \in \{0,1\} and the state space is S=\{0,1\}^{|V|}.
We think of G as a weighted directed graph. To every directed edge we associate a weight function c(v,w)>0 and define c(v)= \sum_{w} c(v,w). We do not need assume that c(v, w)= c(w,v). But we assume that the graph is connected: there is path along directed edges between any pair of two vertices. To this weight we can associate transition probabilities
p(v,w)=\frac{c(v,w)}{c(v)}
Let Y_n be the Markov chain on the state state space V with transtion probaility p_{vw}. It is irreducible and has a stionary distribution \pi(v). If c(v,w)=c(w,v) then the Markov chain Y_n is irreducible and we know that the stationary distribution is \pi(v) = \frac{c(v)}{c_G} where c_G =\sum_{v} c(v). In general \pi might be difficult to compute explicitly.
In the voter model, at each unit time you pick a voter, say voter at vertex v. The voter picks one of his neighbor w with probability p_{vw} and his neighbor at W has a different opinion than his opinion the voter at v adopts the opinion of w. This is admittedly a pretty simplisitic model but let us anlayze it nonetheless. Let denote by X_n the corresponding Markov chain on S. The transition probabilities are given by \begin{aligned} P( {\bf \sigma}, {\bf \sigma} + {\bf e}_v ) &= \frac{1}{|V|} 1_{\{\sigma(v)=0\}} \sum_{w : \sigma(w)=1} p(v,w) \\ P( {\bf \sigma}, {\bf \sigma} - {\bf e}_v ) & = \frac{1}{|V|} 1_{\{\sigma(v)=1\}} \sum_{w : \sigma(w)=0} p(v,w) \end{aligned} where {\bf e}_v is the state with {\bf e}_v(v)=1 and {\bf e}_v(w)=0 if w\not= v.
The key insight is the following
Theorem 3.3 For the voter model M_n = \pi X_n = \sum_{v} \pi(v) X_n(v) is a martingale.
Proof. By the Markov property E[M_{n+1}-M_{n}| X_0, \cdots, X_n ] = E[M_{n+1}-M_{n}| X_n ] so it enough to show that E[M_{n+1}-M_{n}| X_n =\sigma] =0.
To see this we note that X_{n+1}=X_n \pm {\bf e}_v \implies M_{n+1} - M_n = \pm \pi(v) Therefore E[M_{n+1}-M_{n}| X_n =\sigma] = \frac{1}{|V|}\sum_v \sum_{w} \pi(v) p(v,w) \left(1_{\{\sigma(v)=0\}} 1_{\{\sigma(w)=1\}} - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} \right) Now we rewrite \begin{aligned} 1_{\{\sigma(v)=0\}} 1_{\{\sigma(w)=1\}} - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} & = 1_{\{\sigma(v)=0\}} (1- 1_{\{\sigma(w)=0\}}) - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} \\ &= 1_{\{\sigma(v)=0\}} - 1_{\{\sigma(w)=0\}} \end{aligned} and thus E[M_{n+1}-M_{n}| X_n =\sigma] = \frac{1}{|V|} \sum_{v: \sigma(v)=0} \sum_{w} \pi(v) p(v, w) - \sum_{w:\sigma(w)=0} \sum_{v} \pi(v)p(v,w)
Using the fact that \pi is the stationary distribution we have the balance equation
\sum_{v} \pi(v) p(v,w) = \sum_{v} \pi(w) p(w,v)
Using this and then exchanging the indices v and w we find
\sum_{w:\sigma(w)=0} \sum_{v} \pi(v)p(v,w) = \sum_{w:\sigma(w)=0} \sum_{v} \pi(w)p(w,v) = \sum_{v:\sigma(v)=0} \sum_{w} \pi(v)p(v,w)
which implies that
E[M_{n+1}-M_{n}| X_n =\sigma] =0
and therefore M_n is a martingale. \quad \blacksquare
We can now apply Theorem 3.2 where we take the stopping time to be the consensus time at which everyone is agreement. T= \inf\{ n, X_n(v)=0 \text{ for all } v \text{ or } X_n(v)=1 \text{ for all } v Since the Markov chain has finite state space P(T < \infty)=1 and M_n is bounded (in fact 0 \le M_n\le 1). We have like for the gambler’s ruin P(X_T =(1,\cdots, 1))=E[M_T]= E[M_0]= \pi X(0)
Example If the underlying model is a random walk on the graph c(v,w)=c(w,v)=1 then \pi(v) = \frac{\deg(v)}{2|E|} and we can easily compute
\pi X_0. For example if we consider the complete graph on N vertices or any regular graph such that all vertices have the same degree then P(X_T =(1,\cdots, 1)) is simply equal to the proportion of vertices with opinion 1.
Exercise 3.1 Here is another version of the optional sampling theorem. Suppose that M_n is martingale. The increments B_n of the martingales are given by B_n= M_{n}-M_{n-1}.
Show that if the stopping timeT hase finite expectation E[T]<\infty and if the martingale M_n has uniformly bounded increments, i.e.
\sup_{n} E[|B_n|] \le C
then E[M_T]=E[M_0].
Hint: Bound E[|M_{T_n}-M_0|] and then use the dominated convergence theorem to take n\to \infty.
Exercise 3.2 (Wald’s identity) As we have seen before, if N is integer value RV and (X_i) are IID copies of a random variable X, then E[\sum_{k=1}^N X_i]= E[X]E[N].
Prove the folliwing generalization: If T is a stopping time with E[T] finite then E[\sum_{k=1}^N X_i]= E[X]E[N]. Hint: Consider a suitable martingale and use Exercise 3.1.
Exercise 3.3 (Gambler’s ruin probabilities and expected playing time) Suppose P(X_i=1)=p, P(X_i=-1)=q=1-p and let S_n=a + X_0 + \cdots +X_n. And let T = \min \{n :S_n =0 \text{ or } s_n=N \}
Show that M_n= \left(\frac{q}{p}\right)^{S_n} is a martingale with respect to X_1, X_2, \cdots.
Use part 1. and the optional sampling theorem to compute P{S_T=0}.
Use the martingale Z_n = S_n + (1-2p)n and part 2. to compute E[T].
If \phi: \mathbb{R} \to \mathbb{R} is a convex function then the graph of \phi lies above any tangent line to the graph and so for any point x_0 we can find a number a such that \phi(x) \ge \phi(x_0)+ a (x-x_0) If \phi is differentiable at x_0 then a=\phi'(x_0) while in general we have more than one choice for a. This simple geometric observation lies behind Jensen inequality, a simple, yet so important inequality.
Theorem 4.1 (Conditional Jensen inequality) Suppose X is a random variable with E[|X|] <\infty and \phi: \mathbb{R} \to \mathbb{R} a convex function. Then we have \phi(E[ X|Y]) \le E[ \phi(X)|Y] \quad \text{conditional Jensen inequality} In particular we have \phi(E[X]) \le E[ \phi(X)] (Jensen inequality).
Proof. Choosing x_0= E[X|Y] we have \phi(X) \ge \phi(E[X|Y]) + a (X-E[X|Y]) Taking conditional expectation with respect to Y gives the result: E[\phi(X)|Y] \ge E[\phi(E[X|Y])|Y] + a E[(X-E[X|Y])|Y]= \phi(E[X|Y]) \,. \quad \blacksquare
Recall that if X is a normal random variable then its moment generating function is given by E[e^{tX}]= e^{\mu t + \frac{\sigma^2 t^2}{2}} which implies, by Chernov bound the concentration bound P( X - E[X] \ge \epsilon) \le \inf_{t \ge 0} \frac{E\left[e^{t(X- E[X])}\right]}{e^{t\epsilon}} = e^{- \sup_{t \ge 0} \left\{ t\epsilon - \frac{\sigma^2 t^2}{2} \right\} } = e^{-\frac{\epsilon^2}{2\sigma^2}} Note that we also have a similar bound for the left tail, P(X-E[X]\le - \epsilon) \le e^{-\frac{\epsilon^2}{2\sigma^2}} prove in the same way.
It turns out that bounded random variables also satisfies such Gaussian concentration bound. To do this we need the following fact which expresses the fact that among all random variable supported on the interval [-A,B] with mean 0 the one which is most spread out is a random variable concentrated on the endpoints -A and B.
Theorem 4.2 (Hoeffdings bound in conditional form) Suppose X is a random variable such that E[X|Y]=0 and -A \le X\le B. Then for any convex function we have E[\phi(X)|Y]\le \phi(-A) \frac{B}{A+B} + \phi(B) \frac{A}{A+B} In particular if A=B, we have E[ e^{tX}] \le \cosh(At) \le e^{\frac{A^2 t^2}{2}}
Proof. If x \in [-A, B] let us write x as a convex combination of -A and B, that is x= \frac{B-x}{A+B}(-A) + \frac{A+x}{A+B} B \,. The convexity of \phi implies that \phi(X) \le \frac{B-x}{A+B}\phi(-A) + \frac{A+x}{A+B} \phi(B) \,. and taking conditional expectation with respect to Y and uising that E[X|Y]=0 gives the result.
Taking now \phi(x)=e^{tx} and A=B we find E[ e^{tX}] \le \frac{1}{2}(e^{-At} + e^{At}) = \cosh(At) = \sum_{n=0}^\infty \frac{A^{2n} t^{2n}}{(2n)!} \le \sum_{n=0}^\infty \frac{A^{2n} t^{2n}}{2^n n!} =e^{\frac{t^2 A^2}{2}}
The next theorem provides concentration inequality for martingales with bounded increments.
Theorem 4.3 (Azuma-Hoeffding’s theorem) Suppose M_n is a martingale with M_0=0 and with bounded increments, i.e. B_n=M_n-M_{n-1} satisfies the bound |B_n| \le \sigma_n \,. Then we have the Gaussian concentration bound P(M_n \ge \epsilon ) \le e^{-\frac{\epsilon^2}{2 \sum_{k=1}^n \sigma_k^2}}
Proof. Using the Martingale property and Theorem 4.2 \begin{aligned} E[e^{tM_n}]= E[E[e^{tM_n}|\mathcal{F}_{n-1}]] = E[E[e^{tM_{n-1}} e^{tB_n}|\mathcal{F}_{n-1}]] = E[e^{tM_{n-1}} E[e^{tB_n}|\mathcal{F}_{n-1}] ] \le e^{\frac{t^2 \sigma_n^2}{2}} E[e^{tM_{n-1}}] \end{aligned} Iterating we find E[e^{tM_n}] \le e^{\frac{t^2 \sum_{k=1}^n\sigma_k^2}{2}} and Chernov bound gives the result.
McDiarmid theorem is an application of Azuma-Hoeffdings theorem and provides concentration bounds for (nonlinear) function of independent random variables X_1, \cdots, X_n, under certain conditions.
Definition 4.1 We say that h(x_1, x_2, \cdots, x_n) satisifies the bounded difference property if there exist constants c_k such that for all x_k, x_k' |h(x_1,\cdots, x_k, \cdots, x_n) - h(x_1,\cdots, x_k', \cdots, x_n)| \le c_k that is we control the change of h when changing only one coordinate at a time.
Theorem 4.4 (McDiarmid Theorem) Suppose X_1, \cdots X_n are independent RVs and h(x_1, \cdots, x_n) satisfies the bounded difference property (almost surely). Then we have \begin{aligned} P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \ge \varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \\ \end{aligned}
Proof. “Martingale trick”: We construct a suitable martingale. Let us define random variable Y_k by Y_0=E[f(X_1, \cdots, X_n)] and, for 1\le k \le n, Y_k =E[h(X_1,\cdots, X_n)|X_1, \cdots, X_k] Note that Y_n=h(X_1,\cdots, X_n) and by construction E[Y_k| X_1, \cdots, X_{k-1}]= Y_{k-1} that is Y_n is a martingale and so is M_n=Y_n-Y_0. To use Theorem 4.3 we need to prove that the increment Y_k-Y_{k-1} = E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k]- E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}] is bounded.
Duplication trick: Let \widehat{X}_k be an independent copy of the random variable X_k. Then by linearity of the expectation and the bounded difference property we have, almost surely,
\left| E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k] - E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_k]\right| \le c_k
\tag{4.1} Furthermore we have
\begin{aligned}
E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]\\
&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_K]
\end{aligned}
\tag{4.2} The first equality holds because X_k and \widehat{X}_{k} are identically distributed and left-hand side is a function of X_1, \cdots, X_{k-1}. The second equality holds because X_k and \widehat{X}_{k} are idependent. Combining Equation 4.1 and Equation 4.2 shows that |Y_k-Y_{k-1}| \le c_k almost surely. \quad \blacksquare
Example: Suppose S_n = X_1 + \cdots + X_n is a sum of IID random variables such that E[X_i]=\mu and a \le X_i \le b. Then the function h(x_1, \cdots, x_n) satisfies the bounded difference property with c_k=(b-a) and we obtain the classical Hoeffding’s theorem for bounded random variables \begin{aligned} P \left( \frac{X_1 + \cdots + X_n}{n} - \mu \ge \varepsilon \right) &\le e^{- \frac{2 n \varepsilon^2}{(b-a)^2}}\ \end{aligned}
Example: Suppose we are interested in estimating the variance. Then using the unbiased variance estimator V_N(X_1, \cdots, X_N)=\frac{1}{N-1} \sum_{k=1}^N\left(X_i - \frac{S_N}{N}\right)^2 = \frac{1}{N-1}\sum_{i=1}^N X_i^2 - \frac{S_N^2}{(N-1)N} If we change X_1 into \widehat{X}_1 then S_N(X_1, \cdots, X_N ) - S_N(\widehat{X}_1, \cdots, X_N ) = X_1 - \widehat{X}_1 and so V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N) = \frac{X_1^2 - \widehat{X}_1^2}{N-1} - \frac{X_1 - \widehat{X}_1}{N-1} \left(\frac{S_N(X_1, \cdots, X_N )}{N} + \frac{S_N(\widehat{X}_1, \cdots, X_N )}{N} \right)
Let us assume a \le X_i \le b, since we can replace X_i by X_i -(a+b/2) without changing V_N we can instead assume that -\frac{(b-a)}{2} \le X_i \le \frac{(b-a)}{2} and then find the bounded difference bound |V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N)| \le \frac{\frac{5}{4}(b-a)^2}{N-1} = c_k and thus by McDiarmid we get P( V_N - \sigma^2 \ge \varepsilon ) \le e^{- \frac{\varepsilon^2}{2 \sum_{k=1}^N c_k^2}} \le e^{ - \frac{(N-1)^2}{N} \frac{4 \varepsilon^2}{50 c^4}} \,\le\, e^{ -(N-2)\frac{4\varepsilon^2}{50c^4}} and this decay exponentially in N again and can be used for a confidence interval for the variance which is very similar to the one for the mean.
Another classical example in probability is the so-called “Balls and Bins problem”: suppose we have m balls that are thrown in n bins, the location of each ball is chosen at random independently of the other balls. It turns out this problem occur in manay algorithmic optmization problems. We can ask many questions realated to this problem. For example what is the probability that one bin has more than two balls in it (this is a version of the famous birthday problem!), or we can ask what is the maximal number of balls in a bin or the number of empty bins, and so on…
Let us compute first the expected number of empty bins. We write N=X_1 + \cdots X_n where X_i=1 is the i^{th} bin is empty and 0 otherwise. For the first urn to be empty, it must be missed by all balls and this occur with probbaility \left(1-\frac{1}{n}\right)^m and thus E[N]= n\left(1-\frac{1}{n}\right)^m We prove a concentration bound around the mean using a martingale argument. We consider the sequence of random variable Y_i to be the bin in which the i^{th} ball falls and write N=N(Y_1, \cdots, Y_m). To apply Theorem 4.4 we check the bounded difference equality. Consider how N changes when we change the location of the i^{th} ball. If the i^{th} balls lands in a bin of its own then changing Y_i may increase N by 1 or left it unchanged. If it lands in a bin with other balls, then changing Y_i may decrease N by 1. Then, by Theorem 4.3 we have P(|N - E[N]|\ge \epsilon)\le 2 e^{-\frac{2\epsilon^2}{m}}
As another application of Azuma-Hoeffding’s theorem we consider empirical processe which are given by Z(X_1, \cdots, X_n)=\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum_{k=1}^n f(X_i) - E[f(X_i)] \right| where X_i are IID random variables and f belongs to a class of function \mathcal{F}. We assume here that \sup_{x}|f(x)| \le B \text{ for all } f \in \mathcal{F} i.e. the functions f are uniformly bounded. One of the simplest example of empirical process to consider the function f_t(x)=1_{x \le t}. By the LLN we have \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} \to P(X \le t)=F_X(t) \quad \text{ almost surely} where F_X is the distribution of the RVs X_i. By Glivenko–Cantelli Theorem we have \sup_{t \in \mathbb{R} }| \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} - F_X(t)| \to 0 \quad \text{ almost surely} that is we have uniform convergence of the empirical distribution to the true distribution.
Empirical processe satisfy the bounded difference property: indeed if we set g(x_1, \cdots, x_n) = \left| \frac{1}{n}\sum_{k=1}^n f(x_i) - E[f(X)] \right| Then g(x_1, \cdots, \tilde{x}_k, \cdots, x_n) = \left| \frac{1}{n}\sum_{k=1}^n f(x_i) - E[f(X)] + \frac{1}{n} (f(\tilde{x}_k) - f(x_k)) \right| \le g(x_1, \cdots, \tilde{x}_k, \cdots, x_n) + \frac{2B}{n} from which we conclude that Z(x_1, \cdots, \tilde{x}_k, \cdots, x_n) \le Z(x_1, \cdots, {x}_k, \cdots, x_n) +\frac{2B}{n}. Interchanging x_k and {\tilde x}_k proves the bounded difference property with c_k=\frac{2B}{n}. Then Theorem 4.3 show that, for every \epsilon>0. P(|Z - E[Z]|> \epsilon) \le 2 e^{-\frac{n \epsilon^2}{2 B^2}} This tells us that the behavior of Z is control by its mean E[Z]. For example if we set \delta=e^{-\frac{n \epsilon^2}{2 B^2}} we have |Z-E[Z]| \le B \sqrt{\frac{2}{n}\log \frac{1}{\delta}} \quad \text{ with probability at least } 1- 2 \delta for any \delta >0.
To go further we need to control the mean E[Z] and to do this we need two results.
Theorem 4.5 Suppose Y_1, Y_2, \cdots, Y_N are RVs (no independence needed) with E[e^{tY_i}]\le e^{\frac{t^2\sigma^2}{2}} for i=1,\cdots, N. Then E[ \max_{i} Y_i ] \le \sigma \sqrt{2 \ln(N)}
Proof. By Jensen inequality we have e^{t E[ \max_{i} Y_i ]} \le E[ e^{t \max_{i} Y_i } ] \le \sum_i E[e^{t Y_i}]\le N e^{\frac{t^2\sigma^2}{2}} Thus E[ \max_{i} Y_i ] \le \frac{\log{N}}{t} + \frac{t\sigma^2}{2} and optimizing over t yields the result. \quad \blacksquare.
Note that if Y_i satisfies the condition of the theorem so does -Y_i and thus we also have E[ \max_{i} |Z_i| ] = E[ \max_{i} \max\{ Z_i, -Z_i\} ] \le \sigma \sqrt{ 2 \log{2 N}} \le 2 \sigma \sqrt{ \log{N}}
Theorem 4.6 Suppose \epsilon_i are IID Rademacher random variable (i.e. equal to \pm 1 with probability \frac{1}{2}). Then we have E\left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(X)]\right|\right] \le 2 E_{X} \left[ E_{\epsilon} \left[ \sup_{f}|\frac{1}{n} \sum_{i=1}^n \epsilon_{i} f(X_i)| \right]\right]
Proof. We use a duplication trick and consider RVs Y_1, \cdots Y_n which are indepndent copies of X_1, \cdots X_n. We have then
\begin{aligned}
E\left[ \sup_{f}| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(X_i)]|\right] &= E\left[ \sup_{f}| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(Y_i)]|\right] \\
& = E_{X} \left[ \sup_{f}\left| E_{Y} [\frac{1}{n} \sum_{i=1}^nf(X_i)-f(Y_i)] \right| \right] \\
& \le E_{X} E_{Y} \left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^nf(X_i)-f(Y_i) \right|\right] \\
& \le E_{X} E_{Y} \left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^n \epsilon_{i} (f(X_i)-f(Y_i)) \right|\right]
\end{aligned}
where in the last line we have used that multiplying (f(X_i)-f(Y_i)) by a factor -1 is equivalent to exchanging X_i and Y_i and thus does not change the expectation. Taking now expectation over \epsilon_i and using the triangle inequality yields the result.
This is a very useful since have gotten rid of E[f(X_i)] which maybe unknown and we have now a quantity which depnds only on the data X_1, \cdots, X_n. How useful this can be is deonstrated next in the context of Glivenko-Catelli Theorem.
Theorem 4.7 We have P\left( \sup_{t \in\mathbb{R}} \left| \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} - F_X(t) \right| \ge 8 \sqrt{ \frac{\log(n+1)}{n}} + t \right) \le e^{-\frac{n t^2}{2}}
Proof. Using Theorem 4.6 we bound E_{\epsilon}[ \sup_{t \in\mathbb{R}} | \frac{1}{n}\sum_{k=1}^n \epsilon_{i}1_{\{x_i\le t\}}|] for some fixed values of the x_i. Given those values and since f(x)=1_{\{x\le t\}} is a characteristic function of a half interval, as t varies there are only n+1 possible values for the functions (f(x_1), \cdots, f(x_n)) (To see this order the x_i in increasing order). Therefore the supremum over t reduces to a supremum over n+1 values (which depend on x_1, \cdots, x_n)).
Since \epsilon_i satisfies a Gaussian bound E[e^{t \epsilon_i}]\le e^{\frac{t^2}{2}} we have E_\epsilon[e^{ t\frac{1}{n} \sum_{k=1}^n \epsilon_{k} 1_{\{x_k\le t\}}}]= \prod_{k=1}^n E_\epsilon[e^{t \epsilon_k 1_{\{x_k\le t\}}} ] \le \prod_{k=1}^n e^{\frac{t^2}{2n^2}} = e^{\frac{t^2}{2n}} a bound which is independent of the x_i’s!
and thus by Theorem 4.5 we find E_{\epsilon}[ \sup_{t \in\mathbb{R}} | \frac{1}{n}\sum_{k=1}^n \epsilon_{i}\frac{1}{n}1_{\{x_i\le t\}}|] \le \sqrt{4 \ln(n+1)} \frac{1}{\sqrt{n}}
Consider a Markov chain X_j with state space S and transition probabilities P and consider a bounded function f: S \to \mathbb{R} .
We want to construct a martingale associated to the sequence of random variables f(X_j), j=0,1,2,\cdots.
If the f(X_j) were independent it would be enough to assume that they have mean 0.
In general we sue the following idea. If we have sequence of integrable RV Y_1, Y_2, \cdots then we set
D_m= Y_m - E[Y_m|Y_0, \cdots, Y_{m-1}]
and we have then E[D_m|Y_0, \cdots, Y_{m-1}]=0. So we can build a martingale by
M_n = \sum_{j=1}^n D_j = \sum_{j=1}^n (Y_m - E[Y_m|Y_0, \cdots, Y_{m-1}]) \,.
The same idea works for Markov chain: if f is bounded then, using the Markov property D_m = f(X_m)- E[f(X_m)|Y_0, \cdots, Y_{m-1}] = f(X_m)- E[f(X_m)| Y_{m-1}] = f(X_m) -Pf(X_{m-1}) is a martingale increment with respect to the sequence X_1, X_2, \cdots.
We obtain therefore the martingale \begin{aligned} M_n = \sum_{j=1}^n f(X_j) -Pf(X_{j-1}) & = f(X_n) -f(X_0) + \sum_{j=0}^{n-1} (f(X_j)- Pf(X_j)) \\ & = f(X_n) -f(X_0) - \sum_{j=0}^{n-1} Af(X_j) \end{aligned} \tag{5.1} for f bounded and where we used the notation A=P - I. This martingale is called the Dynkin’s martingale.
If we want to apply martingale theory to analyze the sum \sum_{j=1}^n g(X_j) then we would like to use the martingale Equation 5.1. To do this we must a find a function f such that Af = (P -I)f = - g Then we would have a martingale M_n = f(X_n) - f(X_0) + \sum_{j=0}^{n-1} g(X_j) \quad \quad \text{ with } Af = -g which, up to the correction term f(X_n) - f(X_0) is equal to \sum_{j=1}^n g(X_j).
What we have uncovered to build a martingale from a Markov chain is an important equation
Definition 5.1 (Poisson equation) Let X_n be a Markov chain with transition matrix P. A function f satisfies the Poisson equation for the function g if Af = -g \quad \quad \text{where } \quad A =P - I
Note that Poisson equation needs not have a solution for arbitrary g. We investigate this in the context of finite state space Markov chain.
Theorem 5.1 Suppose X_n is an irreducible Markov chain with transition matrix P and stationary distribution \pi. Then the Poisson equation Af =-g has a solution if and only if g has mean 0 with respect to the stationary distribution, that is \pi g = \sum_{i} \pi(i) g(i) =0.
The function f is then given by f= (\Pi -(P-I))^{-1}g where \Pi is the matrix whose each row is the stationary distribution \pi.
Proof. By irreducibility the matrix P has a unique right eigenvector h=(1, \cdots, 1)^T (up to a mutliplicative constant) and unique left eigenvector \pi. The algberaic multiplicity of the eigenvalue 1 is also equal to 1 since (P-1)^2 g=0 implies (P-1)g=ch of Pg = g +ch, which inturns implies that \pi g= \pi Pg = \pi g + c which is impossible unless c=0.
The matrix \Pi is a projection, \pi^2=\pi and we have P \Pi = \Pi P = \Pi. That is \Pi projects onto the eigenspace corresponding to the eigenvalue 1 and we have \Pi f = (\pi f) h. So if \pi f=0 then (P - \Pi)f= Pf. As a consequence (P-\Pi)=P(I -\Pi) has the same eigenvalues as P except the eigenvalue 1 which is replaced by the eigenvalue 0 and thus I-(P-\Pi) is invertible.
If f solves the Poisson equation then (P-I)f = -g and thus \pi P f - \pi f = \pi f -\pi f = 0 = -\pi g which implies that \pi g=0.
Taking g with \pi g =0 let us take f= (\Pi -(P-I))^{-1}g. This implies that \Pi f -(P-I)f = g which proves the claim if we can show that \Pi f =0. But since \Pi commutes with P we have \Pi f = \Pi (\Pi -(P-I))^{-1}g = (\Pi -(P-I))^{-1} \Pi g = 0 provided that \pi g=0. \quad \blacksquare.
We can now use this to prove a central limit theorem for Markov chain.
Theorem 5.2 (central limit theorem for Markov chain) For h: S \to \mathbb{R} let {\overline h} be such that \overline{h}(i)=h(i) - \pi h so that \pi \overline{h} =0. Then for any initial distribution of X_0, as n \to \infty, \frac{1}{\sqrt{n}} \sum_{j=0}^{n-1} {\overline h}(X_j) converges in distribution to a mean zero normal random variable with variance \sigma^2 = \pi f^2 - \pi((Pf)^2) where f is the solution of the Poisson equation (P-I)f = - {\overline h}.
Proof. Consider the martingale M_n = f(X_n) - f(X_0) + \sum_{j=0}^{n-1} {\overline h}(X_j) whose increments are D_j= f(X_j) - Pf(X_{j-1}).
We claim that E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] =1 \tag{5.2} Indeed, using the properties of conditional expectations we have \begin{aligned} & E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E \left[ E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } | \mathcal{F}_{n-1} \right]\right] = E \left[ \frac{E[e^{i \theta M_n}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] \\ & = E \left[ \frac{E[e^{i \theta M_{n-1}+D_n}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E \left[ \frac{ e^{i \theta M_{n-1}} E[e^{i \theta D_{n}}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E\left[ \frac{e^{i \theta M_{n-1}}}{\prod_{j=1}^{n-1} E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] \\ \end{aligned} Iterating proves the statement. We now use a Taylor expansion to find \log E[e^{i\frac{\theta}{\sqrt{n}}D_j}|\mathcal{F}_{j-1}] = - \frac{\theta^2}{2n}E[D_j^2|\mathcal{F}_{n-1}] + \frac{1}{\sqrt{n}} R_n where R_n is a bounded random variable. Note that E[D_j^2|\mathcal{F}_{n-1}] = E[(f(X_j) - Pf(X_{j-1}))^2 | X_{j-1} ] = Pf^2(X_{j-1}) - (Pf(X_{j-1}))^2
Using the the strong law of large numbers for Markov chain,
\lim_{n \to \infty }\sum_{j=1}^\infty \log E[e^{i\frac{\theta}{\sqrt{n}}D_j}|\mathcal{F}_{j-1}] = - \frac{\theta^2}{2} \pi (Pf^2 - (Pf)^2) \quad \text{ almost surely}
Therefore, from Equation 5.2, we find that
\lim_{n \to \infty} E\left[ e^{i \theta M_n} \right] = e^{-\frac{\theta^2 \sigma^2}{2}}
where
\sigma^2 = \pi (f^2 - (Pf)^2)
.
Since \frac{1}{\sqrt{n}}M_n and \frac{1}{\sqrt{n}}\sum_{j=0}^{n-1} {\overline h}{X_j} only differ by \frac{1}{\sqrt{n}}(f(X_n)-f(X_0)) we obtain the central limit theorem. \quad \blacksquare.