Math 606, Spring 2024
University of Massachusetts Amherst
2025-04-24
Martingales are models of fair games and to understand then we need to understand first conditional expectations. Conditional expectations is a very useful concept to understand how information obtained from measurement can be incorporated to make predictions.
Suppose we are given a random variable Y. If we know nothing about the outcome of experiment generating Y then our best guess for the value of Y would be the expectation E[Y]. On the contrary if we measure Y itself then our prediction would Y itself! Conditional expectations deals with making best guesses on the possible value of Y when we have some partial information which is described by some collection of other random variables X_1, X_2, \cdots, X_n.
Example: discrete RV: Suppose X and Y are discrete random variables with joint density and marginals \text{ joint pdf } \quad p(x,y)= P(X=x,Y=y) \quad \text{marginals } \quad p(x) =\sum_{y} p(x,y)\,,\, \quad p(y) =\sum_{x} p(x,y) To define the conditional expectation E[Y|X] we need to give the best guess for Y given that we have observed X=x which is E[Y|X=x] = \sum_{y} y P(Y=y| X=x) = \sum_{y} y\, \frac{P(Y=y, X=x)}{P(X=x)} = \sum_{y} y \,\frac{p(x,y)}{p(x)} which is well defined for those x with p(x)>0.
The function E[Y|X=x] defines a function of the random variable X which we define to be E[Y|X]. For example if we roll two independent dice and X is value of the first roll and Y the sum of the two rolls
then we have
f(x,y) = \frac{1}{36}\,, \quad x=1,2,\cdots, 6 \quad y= x+1, \cdots, x+6.
and this
E[Y|X=x] = x + \frac{7}{2}
so that E[Y|X]=X + \frac{7}{2}.
In a similar way we can define E[Y|X_1,\cdots, X_n] for discrete RV with joint pdf p(x_1, \cdots, x_n, y).
Example: continuous RV: In a similar way, if Y,X_1, \cdots X_n are continuous RV with a joint pdf f(x_1, \cdots, x_n,y) with marginal f(x_1, \cdots,x_n)=\int f(x_1, \cdots, x_n,y) dy then the function y \mapsto \frac{f(x_1, \cdots, x_n,y)}{f(x_1, \cdots,x_n)} defines a probability density function provided (x_1, \cdots,x_n) is such that f(x_1, \cdots,x_n)\not=0. Then the expectation E[Y|X_1=x_1 \cdots X_n=x_n] = \int y \frac{f(x_1, \cdots, x_n,y)}{f(x_1, \cdots,x_n)} \, dy defines a function of (x_1, \cdots,x_n). We leave this function undefined whenever f(x_1, \cdots,x_n)=0 (or set it to 0 if you prefer).
Any function of h(x_1,\cdots,x_n) can be used to define a RV h(X_1, \cdots,X_n). Note that, as a RV, this
does not depend on how the function is defined when f(x_1, \cdots,x_n)=0 since such x have probability 0.
We call the corresponding random variable E[Y|X_1,\cdots, X_n] and call it the conditional expectation of Y given X_1,\cdots, X_n.
The conditional expectation has the following properties
E[Y|X_1 \cdots X_n] depend only on X_1, \cdots, X_n in the sense that it is function h(X_1,\cdots,X_n). In the language of measure theory E[X|Y_1,\cdots,Y_n] is a measurable function with respect to X_1, \cdots, X_n, or better with respect to the \sigma-algebra generated by X_1, \cdots, X_n.
Suppose that A is an event which depends only on X_1, \cdots, X_n, for example the rectangle A = \left\{ a_i \le X_i \le b_i \,,\, i=1, \cdots,n \right\} and let 1_A be the corresponding indicator function. Then E[ Y 1_A ] = E[ E[Y|X_1,\cdots, X_n] 1_A ]
To prove the second property note that \begin{aligned} E[ E[Y|X_1,\cdots, X_n] 1_A ] & = \int E[Y|X_1=x_1, \cdots, X_n=x_n] 1_A(x_1, \cdots, x_n) f(x_1, \cdots, x_n) dx_1 \cdots dx_n \\ & = \int_A \left(\int y \frac{f(x_1, \cdots, x_n, y)}{f(x_1, \cdots, x_n)} dy \right) f(x_1, \cdots, x_n, y) dx_1 \cdots dx_n \\ & = \int_A \int y f(x_1, \cdots, x_n, y) dy dx_1 \cdots dx_n \\ &= E[ Y 1_A ] \end{aligned}
What we have demonstrated with examples are instance of a general theorem. We use the notation \mathcal{F}_n to denote all the information contained in the random variables X_1, \cdots, X_n. We say that a random variable Z is [measurable with respect to \mathcal{F}_n] if Z=h(X_1, \cdots, X_n) can be expressed as a function of (X_1, \cdots, X_n). We say that a set A is measurable with respect to \mathcal{F}_n if the function 1_A is measurable with respect to \mathcal{F}_n. This simply means that A should be specifiued using the random variable X.
Theorem 1.1 Let Y and X_1, \cdots, X_n be random variables on a probability space (\Omega, \mathcal{F}, \mathcal{P}) and assume that E[|Y|]<\infty. Let \mathcal{F}_n be the \sigma-algebra generated by X_1, \cdots, X_n. Then there exists a unique random variable E[Y|X_1, \cdots, X_n] such that
E[Y|X_1, \cdots, X_n] is measurable with respect to \mathcal{F}_n.
For any A measurable with respect to \mathcal{F}_n we have E[ Y 1_A ] = E[ E[Y|X_1, \cdots, X_n] 1_A ] \quad \text{ for all } A \in \mathcal{F}_n \,.
A more geometric way to understand conditional expectation as an (orthogonal) projection is explored in the homework.
From now we use the abbreviated notation for the conditional expectation E[Y| \mathcal{F}_n] = E[Y|X_1, \cdots, X_n]
The conditional expectation has the following properties
Theorem 1.2 The conditional expectation has the following properties
Linearity: E[a_1Y_1 + a_2Y_2| \mathcal{F}_n] =a_1E[Y_1| \mathcal{F}_n] + a_2 E[Y_2| \mathcal{F}_n]
If Y=g(X_1,\cdots,X_n) then E[Y|\mathcal{F}_n] = Y
If Y is independent of X_1,\cdots,X_n then E[Y|\mathcal{F}_n] = E[Y]
If m < n then E[ E[Y|\mathcal{F}_n]|\mathcal{F}_m] = E[Y|\mathcal{F}_m].
If Z=g(X_1,\cdots,X_n) then E[Y Z|\mathcal{F}_n] = Z E[Y|\mathcal{F}_n].
Proof. The idea is to use the uniqueness statement in Theorem 1.1.
For part 1. E[Y_i| \mathcal{F}_n] are the unique \mathcal{F}_n measurable random variables such that E[Y_i 1_A] = E[ E[Y_i| \mathcal{F}_n] 1_A] and so by linearity a_1 E[ E[Y_1| \mathcal{F}_n] 1_A] + a_2 E[ E[Y_1| \mathcal{F}_n] 1_A] = a_1 E[Y_1 1_A]+ a_2 E[Y_2 1_A] = E[(a_1 Y_1 +a_2Y_2) 1_A] and by uniqueness we must have E[a_1Y_1 + a_2Y_2| \mathcal{F}_n] =a_1E[Y_1| \mathcal{F}_n] + a_2 E[Y_2| \mathcal{F}_n].
For part 2. if Y=g(X_1,\cdots,X_n) then Y itself satisfies the definition.
For part 3. if Y is independent of X_1,\cdots,X_n then by independence and linearity E[Y 1_A] = E[Y] E[1_A] = E[ E[Y] 1_A] which, by uniqueness, proves the statement.
For part 4. note that E[ E[Y|\mathcal{F}_n]|\mathcal{F}_m] and E[Y|\mathcal{F}_m] both depend only on X_1, \cdots X_m. Moreover if A is \mathcal{F}_m measurable and m \le n then it is also \mathcal{F}_n measurable. So we have E[ E[Y|\mathcal{F}_m] 1_A] = E[ Y 1_A ] = E[ E[Y|\mathcal{F}_n] 1_A] = E[ E[E[Y|\mathcal{F}_n]|\mathcal{F}_m] 1_A ] which, by uniqueness, proves the statement.
Finally for 5. if Z=1_B and B is \mathcal{F}_n measurable then E[ E[Y 1_B|\mathcal{F}_n] 1_A ] =E[ Y 1_B 1_A ]= E[ Y 1_{A\cap B} ] = E[E[Y |\mathcal{F}_n]] 1_B 1_A] which proves the statement. For general Z one use an approximation argument by simple functions.
Example Suppose X_i are IID random variables with \mu=E[X_i] and let S_n= X_1 + \cdots + X_n. If we take m < n then we have \begin{aligned} E[S_n|\mathcal{F}_m] & = E[ X_1+ \cdots + X_m|\mathcal{F}_m ] + E[ X_{m+1}+ \cdots + X_n|\mathcal{F}_m ] \\ & = X_1+ \cdots + X_m + E[ X_{m+1}+ \cdots + X_n] = S_m + (n-m)\mu \end{aligned} since X_1 + \cdots + X_m is \mathcal{F}_m measurable and X_{m+1} + \cdots + X_n is independent of X_1, X_2, \cdots, X_m (see Theorem 1.2, properties 2 and 3.)
Example Let S_n as in the previous example and assume that \mu=0 and let \sigma^2=V(X_i) be the variance. If we take m < n we find, using properties 2. 3. and 5. of Theorem 1.2. \begin{aligned} E[S_n^2|\mathcal{F}_m] & = E[ (S_m + (S_n-S_m))^2|\mathcal{F}_m ] \\ &= E[ S_m^2 |\mathcal{F}_m ] +2 E[ (S_n-S_m) S_m | \mathcal{F}_m ] + E[(S_n-S_m)^2|\mathcal{F}_m ] \\ & = S_m^2 + 2 S_m E[ (S_n-S_m)| \mathcal{F}_m ] + E[(S_n-S_m)^2] \\ & + S_m^2 +2 S_m E[ S_n-S_m] + E[(S_n-S_m)^2] \\ & = S_m^2 + V(S_n-S_m) =S_m^2 +(n-m)\sigma^2 \end{aligned}
Example If X_i are Bernoulli RV and m <n let us compute E[S_m|S_n]. Note first that P(X_1=1| S_n=k) =\frac{p {n-1 \choose k-1} p^{k-1}(1-p)^{n-k}}{{n\choose k}p^k (1-p)^{n-k}} = \frac{k}{n} \implies E[X_1|S_n] = \frac{S_n}{n} and thus E[S_m|S_n] = \frac{m}{n} S_n \,.
Exercise 1.1 (Conditional expectation as a projection)
Show the following: if E[|Y|^2]< \infty (which ensures that all expectations exists) then E[Y|\mathcal{F}_n] is is the random variable measurable with respect to \mathcal{F}_n which minimizes the mean square error, that is it solves \min \{ E[(Y - Z)^2] \,:\, Z \,\,\mathcal{F}_n-\text{measurable} \} Hint: Write Z as Z=E[Y|\mathcal{F}_n]+ W and expand the square.
Suppose E[Y^2]< \infty and m \le n. Show that E[ (Y-E[Y|\mathcal{F}_n])^2 ] + E[ (E[Y|\mathcal{F}_n]- E[Y|\mathcal{F}_m])^2 ] = E[ (Y-E[Y|\mathcal{F}_m])^2 ] in particular E[ (Y-E[Y|\mathcal{F}_n])^2 ] is a decreasing function of n.
Suppose E[Y^2]< \infty and define the conditional variance to be the random variable V(Y|\mathcal{F}_n)= E[Y^2|\mathcal{F}_n] - E[Y|\mathcal{F}_n]^2. Show that V(Y)= E[ V(Y|\mathcal{F}_n)] + V (E[Y|\mathcal{F}_n])
Definition 2.1 Consider a collection of random variables X_1, X_2, \cdots. A sequence of random variables M_0, M_1, M_2, \cdots is called a Martingale with respect to the filtration \mathcal{F}_n if
E[|M_n|]< \infty for all n.
M_n is measurable with respect to \mathcal{F}_n
For each m < n we have E[M_n|\mathcal{F}_m]=M_m
Remark To verify the martingale porperty 3. in Definition 2.1 it is enough to check that E[M_{n+1}|\mathcal{F}_n]=M_n \text{ for all } n since this property implies that, by item 4. in Theorem 1.1 E[M_{n+2}|\mathcal{F}_n]= E[E[M_{n+2}|\mathcal{F}_{n+1}]|\mathcal{F}_n] = E[M_{n+1}|\mathcal{F}_n]=M_n and so on.
Example One of the prototype of a martingale is given by sum of IID random variables. Let
S_0=0\,, \quad S_n=X_1 + \cdots X_n
As we have seen before, if m< n,
E[S_n|\mathcal{F}_m]= S_m +(n-m)\mu
This implies that M_n= S_n-n\mu is a martingale.
Example Suppose Y is a random variable with E[|Y|]< \infty. Then we can build a martingale with respect to \mathcal{F}_n by setting M_n = E[Y|\mathcal{F_n}] Indeed we have, for m \le n E[M_n|\mathcal{F_m}] = E[E[Y|\mathcal{F_n}]\mathcal{F_m}]= E[Y|\mathcal{F}_m]=M_m In that case we say that the martingale is closed by the random variable Y and we can think of M_n as succesively “better” approximation of Y as we incorporate more and more information.
Suppose we are playing a sequence of independent fair game with two outcomes (e.g betting on the flip a fair coin.) We describe this by RV X_i such that P(X_i=+1) = P(X_i=-1)= \frac{1}{2} The RV describe the winning obtain by betting an amount of 1 on the i^{th} game and the game is fair since E[X_i]=0.
A betting sequence is a sequence of RV B_n such that
B_n is the amount of money bet on the n^{th} game.
B_n is measurable with respect to \mathcal{F_{n-1}}.
E[|B_n|]< \infty.
The second property means that the way you bet on the n^{th} game is guided by the past outcomes of the n-1 previous bets. No peeking into the future allowed! The winnings after n games is given by W_n = \sum_{k=1}^n B_k X_k \quad \text{with} W_0=0 and we show it is a martingale.
Clearly W_n is \mathcal{F}_n measurable and E[|W_n|]< \infty. Moreover we have \begin{aligned} E[W_{n+1}|\mathcal{F}_n]& = E[B_{n+1} X_{n+1}| \mathcal{F}_{n}] + E[W_n| \mathcal{F}_{n}] = B_{n+1} E[X_{n+1}| \mathcal{F}_{n}] + W_n = W_n \end{aligned} where we have used that B_{n+1} is \mathcal{F}_n measurable and E[X_n]=0.
The martingale property implies that E[W_n] is constant: we have E[W_n] = E[ E[W_n|\mathcal{F}_{n-1}]] =E[W_{n-1}] which means that your expected winning in fair game is zero.
But this is not the end of the story. In a betting strategy you will do a sequence of bet and decide of a good moment when you actually stop betting. Consider for example the following well known strategy (often called the martingale strategy): double your bet until you win. If you win the first game you stop and take you gain of W_1=1. If you lose the first game you bet 2 on the second game. If you win the second game you winning is W_2=-1+2 =1 and then stop. If you lose the first two games you know bet 4 on the third game abnd if you win the third game you winning is W_3=-1-2+4=1, and so on… We have then the transition probabilities P(W_{n+1}=1|W_n=1)=1 and P(W_{n+1}=1|W_n=-(2^n-1))=\frac{1}{2} \quad \quad P(W_{n+1}=-(2^{n+1}-1)|W_n=-(2^n-1))=\frac{1}{2} and this is a martingale. It is true that E[W_n]=0 but however when you stop, which happens at a random time T, you always win 1! The time T at which you first win occurs has here a geometric distribution. We will consider stopping time in the next section.
Consider an urn with balls of two colors, say red and green. Assume that initially there is one ball of each color in the urn. At each time step, a ball is chosen at random from the urn. If a red ball is chosen, it is returned and in addition another red ball is added to the urn. Similarly, if a green ball si chosen, is is returned together with another green ball.
Let X_n denote the number of red balls in the urn after n draws. Then X_0=1 and X_n a (time-inhomogeneous) Markov chain with transitions
P(X_{n+1}=k+1| X_n=k) = \frac{k}{n+2} \quad \quad P(X_{n+1}=k| X_n=k) = \frac{n+2 -k}{n+2}
We now define
M_n= \frac{X_n}{n+2} \quad \text{ fraction of red balls after $n$ draws}
Then M_n is a martingale since
E[X_{n+1}|X_n=k]= (k+1)\frac{k}{n+2} + k \frac{n+2 -k}{n+2} = \frac{k}{n+2} + k \implies E[X_{n+1}|X_n] = X_n + \frac{X_n}{n+2}.
Since this is a Markov chain,
E[M_{n+1}|\mathcal{F}_n]=E[M_{n+1}|X_n]= E\left[\frac{X_{n+1}}{n+3}|X_n\right]= \frac{1}{n+3}\left(X_n + \frac{X_n}{n+2}\right) =\frac{X_n}{n+2}=M_n
There is a natural connection between Markov chain and martingale. We explain this in the context of Markov chains but this is just an example. Consider a function f:S \to \mathbb{R} and let us derive an equation for E[f(X_t)]. Using Kolmogorov equation we have \frac{d}{dt}E[f(X_t)] = \frac{d}{dt} \sum_{i} f(i) p_t(i) = \sum_{i} f(i) (p_t A)(i) = \sum_{i} f(i) \sum_{j} p_t(j) A(j,i) = \sum_{j} \sum_{i} A(j,i) f(i) p_t(j) and thus \frac{d}{dt}E[f(X_t)] = E[ g(X_t)] \quad \text{ where } g = Af Integrating the previous equation we find E[f(X_t)] - E[f(X_0)] = \int_0^t E[ Af(X_s)] ds
There is a martingale hidden in this equation. Indeed consider the random variables Y_t = f(X_t) - f(X_0) - \int_0^t Af(X_s) \, ds Our previous calculation shows that E[Y_t]=0, moreover by the Markov property we have E[Y_{t+s} -Y_t | \mathcal{F}_t ] = E[ f(X_{t+s}) - f(X_t) - \int_t^{t+s} Af(X_u) du | X_t ] =0
Exercise 2.1
Suppose X_1, X_2, \cdots are IID random variable with E[X_i]=1. Show that M_n = X_1 X_2 \cdots X_n is a martingale.
Suppose the RV X has pdf f(x) and the RV Y has PDF g(x). Show that M_n= \prod_{j=1}^n \frac{g(X_i)}{f(X_i)} is a martingale with respect to X_1, X_2, \cdots where X_i are IID with pdf f(x). This martinagle is called the likelihood ratio martingale.
Suppose X_1, X_2, \cdots are IID random variable with moment generating function M(t)=E[e^{tX_i}] and let S_n=X_1 + \cdots + X_n. Show that M_n = \frac{e^{tS_n}}{M(t)^n} is a martingale.
Exercise 2.2 Suppose N_t is a Poisson process with rate \lambda. Show that N_t-\lambda t is a martingale.
Exercise 2.3 Consider a branching process X_n with mean offspring number \mu and extinction probability a. Show that
M_n = X_n \mu^{-n} is a martingale with respect to X_0, \cdots, X_n.
M_n = a^{X_n} is a martingale with respect to X_0, \cdots, X_n.
Exercise 2.4 Show (by induction) that for the polya’s urn we have P(X_n=k+1) = \frac{1}{n+1} \text{ for } k=1, 2, \cdots Show that M_n converges to M_\infty in distribution. Find M_\infty.
A stopping time T with respect to a sequence of random variables X_0, X_1, \cdots should be such that the random time at which you decide to stop depends only on the information you have accumulated so far. If you decide to stop at time n then it should depend only on X_0, \cdots, X_n and you are not allowed to peek into the future.
Definition 3.1 A stopping time T with respect to the filtration \mathcal{F}_n is a random variable taking values \{0,1,2,\cdots, +\infty\} such that for any n the event \{T=n\} is measurable with respect to \mathcal{F}_n.
Example: T=k is a stopping time.
Example: The hitting time T_A =\inf\{j, X_j \in A \}, for some set A, is a stopping time.
Example: If T and S are stopiing time then \min\{S,T\} is also a stopping time. In particular T_n = \min\{T,n\} is a bounded stopping time since T_n \le n and we have T_0 \le T_1 \le \cdots \le T_n \le T.
The optional sampling theorem says, roughly speaking, that “you cannot beat a fair game” which means that if M_n is a martingale and T is a stopping time then E[M_T]=E[M_0]. This is not true in general as we have seen when considering the martingale betting system where 1=E[M_T]\not= E[M_0]=0. We start with the following result
Theorem 3.1 (Optional Sampling Theorem (T bounded)) Suppose M_n is a martingale and T is a bounded stopping time (i.e. T \le K) then E[M_T|\mathcal{F}_0]=M_0 and in particular E[M_T]=M_0.
Proof. Since T\le K we can write M_{T}=\sum_{j=0}^K M_j 1_{\{T=j\}} We compute next E[M_T|\mathcal{F}_{K-1}]. Note that since T is bounded by K we have 1_{\{T=K\}}= 1_{\{T> K-1\}} which is measurable with respect to \mathcal{F}_{K-1}. Therefore we find
\begin{aligned} E[M_{T}|\mathcal{F}_{K-1}] & = E[ M_K 1_{\{T> K-1\}} | \mathcal{F}_{K-1}] + \sum_{j=0}^{K-1} E[ M_j 1_{\{T=j\}}|\mathcal{F}_{K-1}] \\ & = 1_{\{T> K-1\}} E[ M_K | \mathcal{F}_{K-1}] + \sum_{j=0}^{K-1} M_j 1_{\{T=j\}} = 1_{\{T> K-1\}} M_{K-1} + \sum_{j=0}^{K-1} M_j 1_{\{T=j\}} \\ & = 1_{\{T> K-2\}} M_{K-1} + \sum_{j=0}^{K-2} M_j 1_{\{T=j\}} \end{aligned} where we used that M_j 1_{\{T=j\}} is \mathcal{F}_{K-1} measurable if j\le k-1.
Using this we find \begin{aligned} E[M_{T}|\mathcal{F}_{K-2}]&=E[ E[M_{T}|\mathcal{F}_{K-1}]|\mathcal{F}_{K-2}]\\ & = E[ 1_{\{T> K-2\}} M_{K-1}|\mathcal{F}_{K-2}]+ \sum_{j=0}^{K-2} E[M_j 1_{\{T=j\}}|\mathcal{F}_{K-2}] \\ &= 1_{\{T> K-3\}} M_{K-2} + \sum_{j=0}^{K-3} M_j 1_{\{T=j\}} \end{aligned} and thus, inductively, E[M_T|\mathcal{F}_0]= M_0 \quad \blacksquare
To prove a more general version let us assume that P(T < \infty)=1. Then T_n=\min\{T,n\} converges to T and we can write M_{T} = M_{T_n} 1_{\{T\le n\}} + M_{T} 1_{\{T > n\}} = M_{T_n} + M_{T} 1_{\{T >n \}} - M_n 1_{\{T >n \}}
Since T_n is bounded by the optional sampling theorem we have E[M_{T_n}]=M_0. But we need then to control the remaining two terms.
If we assume that M_T is integrable, E[|M_T|]< \infty, the assumption P(T < \infty)=1 means that 1_{\{T>n\}} converges to 0 and thus by the dominated convergence theorem we have \lim_{n \to \infty} E[M_{T} 1_{\{T >n \}}]=0.
The third term is more troublesome. Indeed for the martingale betting systems, if T > n it means we lost n bets in a row which happens with a probability of \frac{1}{2^n} for a total loss of -1 -2 - \cdots - 2^{n-1}=-(2^n-1). Therefore E[M_n 1_{\{T> n\}}] = \frac{1}{2^n}(1-2^n) \to -1 \text{ as } n \to \infty It does not converge to 0 but to 1 in accordance with the result E[M_T]=1.
These considerations leads to the following
Theorem 3.2 (Optional Sampling Theorem (general version)) Suppose M_n is a martingale and T is a finite stopping time (i.e. P(T < \infty)=1). If E[|M_T|<\infty] and \lim_{n \to \infty}E[ |M_n| 1_{\{T>n\}}] =0 then E[M_T]=M_0.
Remark If the sequence M_n is uniformly bounded, i.e. |M_n| \le C then the optional sampling theorem holds since
E[ |M_n| 1_{\{T>n\}}] \le C P(T >n).
Remark Another condition which gurantees the optional sampling theorem is if C=\sup_{n}E[M_n^2] < \infty. Indeed if this holds given \epsilon>0 we have \begin{aligned} E[|M_n| 1_{T>n}] & = E[|M_n| 1_{|M_n| > C/\epsilon} 1_{T>n}] + E[|M_n| 1_{|M_n| \le C/\epsilon} 1_{T>n}] \\ & \le \frac{\epsilon}{C} E[|M_n|^2 1_{\{T>n\}} ] + \frac{C}{\epsilon}P(T >n) \le \epsilon + \frac{C}{\epsilon}P(T >n) \end{aligned} Taking n \to \infty shows that \lim_{n} E[|M_n| 1_{\{T>n\}}] \le \epsilon.
Example: Gambler’s ruin probability. Define S_0=a and S_n = a+ X_1 + \cdots + X_n where X_i arre IID fair bets P(X_i=-1)=P(X_i=1)=\frac{1}{2}.
Then S_n is a martingale and consider the stopping time
T = \min \{n :S_n =0 \text{ or } s_n=N \}
which describe the time at which a gambler starting with a fortune a either go bankrupt or reach a fortune of N. Note that if T>n then S_n \le N and thus E[|S_n| 1_{\{T> n\}}] \le N P(T >n) \to 0 as n \to \infty.
E[S_T] = E[S_0]=a.
But we get then
a= E[S_T]= N P(S_T=N) \quad \implies P(S_T=N)= \frac{a}{N}
This gives another (computation free) derivation of the gambler’s ruin formula! The case p\not= \frac{1}{2} can also be treated using a (different) martingale and will be considered in the homework.
Example: Gambler’s ruin playing time. Suppose S_n is like in the previous example. Then M_n=S_n^2 -n is also martingale since
E[S_{n+1}^2- {n+1} | \mathcal{F_n}]= E[S_n^2 + 2X_{n+1}S_n + X_{n+1}^2- {n+1} | \mathcal{F_n}] = S_n^2 + 1 - (n+1) = S_n^2 -n
To apply the optional sampling theorem we note that P(T>n) \le C \rho^n since T is a hiting time for a finite state Markov chain.
Therefore we have E[|M_n| 1_{\{T> n\}}] \le (N^2 + n^2) C \rho^n \to 0 and we can apply the optional sampling theorem and
E[M_T] =E[M_0]=a^2
But, using the previous example we find
a^2= E[M_T] = E[S_T^2] - E[T] = N^2 P(S_T=N) - E[T] = aN - E[T]
and thus
E[T]= a(N-a)
which gives the expected length of play.
We can use martingale like for the gambler’s ruin to analyze much more complicated models. The voter model is a simple opinion dynamics model. Here are the ingredients.
A graph G=(V,E) is given. At each vertex of the graph is an agent who can be either in state 0 or in state 1. We view this as two different possible opinions for the agent. We describe the state of the system by a vector {\bf \sigma} = \left( \sigma(v)\right)_{v \in V} \quad \text{ with } \sigma(v) \in \{0,1\} and the state space is S=\{0,1\}^{|V|}.
We think of G as a weighted directed graph. To every directed edge we associate a weight function c(v,w)>0 and define c(v)= \sum_{w} c(v,w). We do not need assume that c(v, w)= c(w,v). But we assume that the graph is connected: there is path along directed edges between any pair of two vertices. To this weight we can associate transition probabilities
p(v,w)=\frac{c(v,w)}{c(v)}
Let Y_n be the Markov chain on the state state space V with transtion probaility p_{vw}. It is irreducible and has a stationary distribution \pi(v). If c(v,w)=c(w,v) then the Markov chain Y_n is irreducible and we know that the stationary distribution is \pi(v) = \frac{c(v)}{c_G} where c_G =\sum_{v} c(v). In general \pi might be difficult to compute explicitly.
In the voter model, at each unit time you pick a voter, say voter at vertex v. The voter picks one of his neighbor w with probability p_{vw} and, if his neighbor at vertex w has a different opinion than his opinion, the voter at v adopts the opinion of w. This is admittedly a pretty simplistic model but let us analyze it nonetheless.
Let denote by X_n the corresponding Markov chain on S. The transition probabilities are given by \begin{aligned} P( {\bf \sigma}, {\bf \sigma} + {\bf e}_v ) &= \frac{1}{|V|} 1_{\{\sigma(v)=0\}} \sum_{w : \sigma(w)=1} p(v,w) \\ P( {\bf \sigma}, {\bf \sigma} - {\bf e}_v ) & = \frac{1}{|V|} 1_{\{\sigma(v)=1\}} \sum_{w : \sigma(w)=0} p(v,w) \end{aligned} where {\bf e}_v is the state with {\bf e}_v(v)=1 and {\bf e}_v(w)=0 if w\not= v.
The key insight is the following
Theorem 3.3 For the voter model M_n = \pi X_n = \sum_{v} \pi(v) X_n(v) is a martingale.
Proof. By the Markov property E[M_{n+1}-M_{n}| X_0, \cdots, X_n ] = E[M_{n+1}-M_{n}| X_n ] so it enough to show that E[M_{n+1}-M_{n}| X_n =\sigma] =0.
To see this we note that X_{n+1}=X_n \pm {\bf e}_v \implies M_{n+1} - M_n = \pm \pi(v) Therefore E[M_{n+1}-M_{n}| X_n =\sigma] = \frac{1}{|V|}\sum_v \sum_{w} \pi(v) p(v,w) \left(1_{\{\sigma(v)=0\}} 1_{\{\sigma(w)=1\}} - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} \right) Now we rewrite \begin{aligned} 1_{\{\sigma(v)=0\}} 1_{\{\sigma(w)=1\}} - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} & = 1_{\{\sigma(v)=0\}} (1- 1_{\{\sigma(w)=0\}}) - 1_{\{\sigma(v)=1\}}1_{\{\sigma(w)=0\}} \\ &= 1_{\{\sigma(v)=0\}} - 1_{\{\sigma(w)=0\}} \end{aligned} and thus E[M_{n+1}-M_{n}| X_n =\sigma] = \frac{1}{|V|} \sum_{v: \sigma(v)=0} \sum_{w} \pi(v) p(v, w) - \sum_{w:\sigma(w)=0} \sum_{v} \pi(v)p(v,w)
Using the fact that \pi is the stationary distribution we have the balance equation
\sum_{v} \pi(v) p(v,w) = \sum_{v} \pi(w) p(w,v)
Using this and then exchanging the indices v and w we find
\sum_{w:\sigma(w)=0} \sum_{v} \pi(v)p(v,w) = \sum_{w:\sigma(w)=0} \sum_{v} \pi(w)p(w,v) = \sum_{v:\sigma(v)=0} \sum_{w} \pi(v)p(v,w)
which implies that
E[M_{n+1}-M_{n}| X_n =\sigma] =0
and therefore M_n is a martingale. \quad \blacksquare
We can now apply Theorem 3.1 where we take the stopping time to be the consensus time at which everyone is agreement. T= \inf\{ n, X_n(v)=0 \text{ for all } v \text{ or } X_n(v)=1 \text{ for all } v \} Note that these are absorbing states and since the Markov chain has finite state space P(T < \infty)=1 and M_n is bounded (in fact 0 \le M_n\le 1). We have, similarly to the gambler’s ruin, P(X_T =(1,\cdots, 1))=E[M_T]= E[M_0]= \pi X(0) which is the absorbtion probabilties.
Example If the underlying model is a random walk on the graph c(v,w)=c(w,v)=1 then \pi(v) = \frac{\deg(v)}{2|E|} and we can easily compute
\pi X_0. For example if we consider the complete graph on N vertices or any regular graph such that all vertices have the same degree then P(X_T =(1,\cdots, 1)) is simply equal to the proportion of vertices with opinion 1.
Exercise 3.1 Here is another version of the optional sampling theorem. Suppose that M_n is martingale. The increments B_n of the martingales are given by B_n= M_{n}-M_{n-1}.
Show that if the stopping timeT hase finite expectation E[T]<\infty and if the martingale M_n has uniformly bounded increments, i.e.
\sup_{n} E[|B_n|] \le C
then E[M_T]=E[M_0].
Hint: Bound E[|M_{T_n}-M_0|] and then use the dominated convergence theorem to take n\to \infty.
Exercise 3.2 (Wald’s identity) As we have seen before, if N is integer value RV and (X_i) are IID copies of a random variable X, then E[\sum_{k=1}^N X_i]= E[X]E[N].
Prove the folliwing generalization: If T is a stopping time with E[T] finite then E[\sum_{k=1}^N X_i]= E[X]E[N]. Hint: Consider a suitable martingale and use Exercise 3.1.
Exercise 3.3 (Gambler’s ruin probabilities and expected playing time) Suppose P(X_i=1)=p, P(X_i=-1)=q=1-p and let S_n=a + X_0 + \cdots +X_n. And let T = \min \{n :S_n =0 \text{ or } s_n=N \}
Show that M_n= \left(\frac{q}{p}\right)^{S_n} is a martingale with respect to X_1, X_2, \cdots.
Use part 1. and the optional sampling theorem to compute P{S_T=0}.
Use the martingale Z_n = S_n + (1-2p)n and part 2. to compute E[T].
The martingale convergence theorem asserts that, under quite general circumstances, a martingale M_n converges to a limiting random variabel M_\infty.
Theorem 4.1 (Martingale convergence theorem (version 1)) If M_n is a martingale such that E[|M_n|]\le c for all n then there exists a random variable M_\infty such that M_n converge to M_\infty almost surely.
Proof. Pick two (arbitrary) numbers a < b. The idea of the proof is to show that the probability that the martingale fluctuate infinitely often between a and b is zero. Since this will be true for any a,b this shows almost sure convergence.
Consider the following betting strategy. Think of M_n as the cumulative gain from a sequence of fair games and thus M_{n+1}-M_n is the gain from the n+1st game. Take make the following sequence of bets
If M_n <a makes bets B_n=1 until the martingale M_n reaches or exceeds the value b.
Once M_n reaches b, stop betting (that is B_n=0) until M_n comes back to less than a.
Continue this process. If the martingale fluctuates infinitely often between a and b then the betting strategy will provide a long term gain and we show that this cannot happen because of the martingale property.
The gain from this strategy after n games is given by W_{n}=\sum_{j}B_j(M_j - M_{j-1}) where B_j is either 0 or 1 depending on the position of the martingale. Since M_j is a martingale then W_n is a martingale with respect M_0, M_1, M_2. Indeed we we have E[M_{n+1}-M_n|\mathcal{F}_n]=0. E[W_{n+1}|\mathcal{F}_n]= E[B_{n+1}(M_{n+1} - M_n)|\mathcal{F}_n] + E[W_n|\mathcal{F}_n] = W_n .
Let U_n the number of upcrossing up to step n that is the number of times the martingales goes from below a to above b. Then from the structure of the betting W_n \ge (b-a) U_n- (W_n-a)_{-} since (W_n-a)_- overestimate the loss during the last interval of play (if B_n=1). Since E[W_n]=E[W_0] we have E[W_0] =E[W_n] \ge (b-a) E[U_n] - E[(W_n-a)_{-}] \implies E[U_n] \le \frac{E[(W_n-a)_{-}]}{b-a} \le \frac{c+a}{b-a} Since the right hand side is independent of n, the number of upcrossing up to infinity U_\infty has finite expection and thus U_\infty is finite almost surely. \quad \blacksquare
From the convergence theorem we have M_n \to M_\infty almost surely. In general it does not imply that \lim_{n}E[M_n]=E[M_\infty] without further assumption on M_n. For example, for the martingale betting system we have E[M_n]=0 for all n. However if we stop betting at time T (first win) then our gain after that stay at 1 and thus we have \lim_n {W_n} = 1 almsot surely so M_\infty=1 and clearly E[M_n] does not converge to E[M_\infty]!
In order to obtain convergence we need a stronger condition on M_n than our assumption \sup_{n}E[|M_n|] \le C <\infty. The proper mathemtical asuumption is that the sequence M_n should be uniformly integrable (see Math 605 for more details). The sequence \{X_n\} is uniformaly integrable if \sup_{n} E[|X_n|1_{|X_n|\ge R}] \to 0 \text{ as } R \to \infty This means that the tail behavior of X_n is controlled uniformly. We have the following
Theorem 4.2 If X_n converges almost surely to X and \{X_n\} is uniformly integrable then \lim_{n} E[X_n] = E[X].
For example if \sup_{n}E[||X_n^2] \le c < \infty then the \{X_n\} are uniformly integrable. The argument is the same as the one used after Theorem 3.2
Theorem 4.3 (Martingale convergence theorem (version 2)) If M_n is a martingale and \{M_n\} are uniformly integrable then there is a random variable M_\infty such that M_n \to M_\infty almost surely and \lim_{n \to \infty} E[M_n]= E[M_\infty]
Random harmonic series It is well-known that the harmonic series 1 + \frac{1}{2} + \frac{1}{3} + \cdots diverges while the alternating harmonic series 1 - \frac12 + \frac13 - \cdots converges.
What does happen if we chose the sign in the series randomly? Let X_i be IID random variables with P(X_i=-1)=P(X_i=1)=\frac{1}{2}. Let M_0=0 and for n > 0 define M_n = \sum_{j=1}^n \frac{1}{j}X_j\,. Since M_n is a sum of independent random variable with mean 0, it is a martingale and we have E[M_n]=0 for all n.
By the martingale convergence theorem we have M_n \to M_\infty converges almost surely. Therefore the random harmonic series \sum_{j=1}^\infty \frac{1}{j}X_j converges almost surely.
Note that by independence E[M_n^2] = \sum_{j=1}^n \frac{1}{j^2} < \infty and thus M_n is uniformly integrable and E[\sum_{j=1}^\infty \frac{1}{j}X_j]=0.
Suppose X_n is a branching process with offspring distribution given by a random variable Z with mean \mu=E[Z].
First we prove that M_n= \mu^{-n} X_n is a martingale. We have, using the Markov property, that E[X_{n+1}|\mathcal{F}_n]= E[X_{n+1}|X_n] and E[X_{n+1}|X_n=k] = E[Z^{(n)}_1 +\cdots + Z^{(n)}_k] = \mu k and thus E[X_{n+1}|X_n]=\mu X_n. This proves that \mu^{-n} X_n is a Martingale.
If \mu\le 1 we have proved that the probability of extinction is 1 and thus X_n=0 for all n sufficiently large and thus M_n converges to 0 almost surely.
If \mu>1 and then M_n=X_n/\mu^n. By the Martingale convergence theorem we have M_n \to M_\infty. We show that M_\infty is a non-trivial random variable by showing that E[M_\infty]=\lim_n E[M_n]=1 (if we start with single individual).
To do this we need to show uniform integrability. Suppose \sigma^2=V(Z). We have the formula E[(M_{n+1}-M_n)^2|\mathcal{F}_{n}] = E[ M_{n+1}^2|\mathcal{F}_n] -2 E[ M_{n+1} M_n |\mathcal{F}_n] + E[M_n^2|\mathcal{F}_n] = E[ M_{n+1}^2|\mathcal{F}_n] - M_n^2 and so E[ M_{n+1}^2|\mathcal{F}_n] =M_n^2 + E[(M_{n+1}-M_n)^2|\mathcal{F}_{n}] To compute the second term note that E[(M_{n+1}-M_n)^2|\mathcal{F}_{n}]=E[ (\mu^{-(n+1)}X_{n+1}- \mu^{-n} X_n)^2|\mathcal{F}_{n}]= \mu^{-2(n+1)}E[ (X_{n+1}-\mu X_n)^2|\mathcal{F}_{n} ] Now E[ (X_{n+1}-\mu X_n)^2|X_n=k] = E[ ( Z^{(n)}_1 +\cdots + Z^{(n)}_k -\mu k)^2 ]= k \sigma^2 and so E[ (X_{n+1}-\mu X_n)^2|X_n]= X_n \sigma^2.
Combining all this gives (using again that E[\mu^{-n}X_n]=E[M_n]=1) E[M_{n+1}^2] = E[M_n^2] + \mu^{-2(n+1)}\sigma^2 E[X_n] = E[M_n^2] + \frac{\sigma^2}{\mu^{n+2}} Using that E[M_0^2]=1 we find E[M_1]=1+ \frac{\sigma^2}{\mu^2} and by induction E[M_n^2] = 1 + \sigma^2 \sum_{k=2}^{n+1} \frac{1}{\mu^k} This proves that \sup_{n}E[|M_n|^2] < \infty and thus M_n is uniformly integrable.
Thus M_\infty is non-trivial.
Estimation problem Suppose X_1, X_2, \cdots are IID random variables whose mean E[X_i]=\theta^* is unknown. In a Bayesian spirit we equip \theta with a probability distribution (called the prior distribution) and called this random variable \Theta.
Example: The simplest model is when X_i are Bernoulli random variables with a unknown probability of success \theta. A natural prior distribution for \Theta is the uniform distribution on [0,1]. In this case we would have the joint distribution f(x_1, \cdots, x_n ,\theta) = f(x_1, \cdots, X_n |\theta)f(\theta)= {n \choose k} \theta^{x_1+\cdots+ x_n} (1-\theta)^{n-x_1+\cdots+ x_n}
There is a natural martingale associated, namely M_0 = E[\Theta], \quad \quad M_n =E[\Theta|X_1, \cdots, X_n] This means that M_0 is the expectation of \theta under the prior distribution f(\theta) and M_n is the expectation of \theta inder the posterior distribution f(\theta|x_1, \cdots, x_n).
By the martigale convergence theorem (under suitable assumptions on \theta to assure uniform integrability of M_n, for example if \Theta is bounded) we have \lim_{n \to \infty} M_n = M_\infty where M_\infty is a random variable which depends on the infnite sequence X_1, X_2, \cdots.
Since for m> n we have M_n = E[M_m|X_1, \cdots, X_n] taking m \to \infty shows that M_n=E[M_\infty|X_1, \cdots, X_n]. Since \theta is the mean, by the LLN, for fixed \theta we have \theta= \lim_{n\to \infty} \frac{X_1 + \cdots + X_n}{n} and thus \theta is a function of X_1, X_2, \cdots. This shows that M_\infty=\theta and thus \lim_{n \to \infty} E[\Theta|X_1, \cdots, X_n] = \theta
In our simple example we can compute the posterior distribution by Bayes rule (after dusting up our knowledge of the Beta random variables) f(\theta|x_1,\cdots, x_n ) = \frac{{n \choose k} \theta^{x_1+\cdots+ x_n} (1-\theta)^{n-x_1+\cdots+ x_n}}{\int_0^1 {n \choose k} \theta^{x_1+\cdots+ x_n} (1-\theta)^{n-x_1+\cdots+ x_n} d\theta} = \frac{(n+1)!}{k! (n-k)!} \theta^{x_1+\cdots+ x_n} (1-\theta)^{n- x_1+\cdots+ x_n} which a beta random variable with parameter \alpha=k+1 and \beta=n+1-k. The mean of beta random variables with parameters \alpha and \beta is \frac{\alpha}{\alpha + \beta} E[\Theta| X_1, \cdots, X_n] = \frac{1+ X_1+ \cdots + X_n}{n+2}
This is related to the Polya’s urn. To see this let us compute P(X_1+ \cdots + X_{n+1}=k+1|X_1+ \cdots+ X_n=k) under that model. By conditioning we find \begin{aligned} & P(X_1+ \cdots + X_{n+1}=k+1|X_1+ \cdots+ X_n=k) \\ &= \int_0^1 P( X_{n+1}=1|\Theta=\theta) f(\theta|X_1+\cdots X_n=k) d\theta = E[\theta|X_1+\cdots X_n=k]=\frac{k+1}{n+1} \end{aligned}
Let us consider the general Polya’s urn starting with r red balls and g green balls. At each time a ball is drawn at random, replaced in the urn together with c extra balls of the same color.
We let X_n to be the number of green balls at time n and M_n=\frac{X_n}{r+g+nc} to be fraction of green balls at time n. We P(X_{n+1}=j+c|X_n=j) = \frac{j}{r+g+nc} \quad \quad P(X_{n+1}=j|X_n=j)=\frac{r+g+nc-j}{r+g+nc} Thus \begin{aligned} E[M_{n+1}|X_n=j]&= \frac{j+c}{r+g+(n+1)c}\frac{j}{r+g+nc} + \frac{j}{r+g+(n+1)c} \frac{r+g+nc-j}{r+g+nc} \\ & = \frac{j(r+g +(n+1)c)}{(r+g+(n+1)c)(r+g+n+c)} = \frac{j}{r+g+n+c} \end{aligned}
This implies that M_{n+1} is a martingale and by the martingale convergence theorem we have M_n \to M_\infty. Let us compute the distribution of M_\infty. To do this imagine that in n draws the first m ones are green and the last l=n-m are red. This event has probability \frac{g}{g+r}\frac{g+c}{g+r+ c}\cdots \frac{g+ (m-1)c}{g+r +(m-1)c} \frac{r}{g+r+mc}\cdots \frac{r+ (n-m-1)c}{g+r +(n-1)c} Observe that drawing m green balls and l=n-m red balls in n draws in any order has exactly the same probability: the deonminator remains the same but the terms in the numerator are permuted.
This is all we need. As a warm up let us consider the special case r=g=c=1 we have P(X_n= m+1) = {n \choose m} \frac{m!(n-m)!}{(n+1)!} = \frac{1}{n+1} \quad m=0,1,2,\cdots, n This shows M_n converges to the uniform distribution on [0,1]. P(X_n= m+2) = {n \choose m} \frac{m!(n-m)!}{(n+1)!} = \frac{1}{n+1} \quad m=0,1,2,\cdots, n
Next let us consider c=1 and g, r arbitrary \begin{aligned} P(X_n= m+g+r) & = {n \choose m} \frac{r(r+1)\cdots (r+m-1) g (g+1) \cdots (g+n-m-1)} {(g+r)(g+r+1) \cdots (g+r+(n-1))} \\ &= \frac{(g+r-1)!}{(r-1)!(g-1)!} \frac{n!}{m!(n-m)!} \frac{(r+m-1)! (g+n-m-1)!} {(g+r+(n-1))!} \\ &= \frac{(g+r-1)!}{(r-1)!(g-1)!} \frac{m^{r-1} (n-m)^{g-1} }{n^{r+g-1}} \frac{\frac{(r+m-1)!}{m!m^{r-1}} \frac{(g+n-m-1)!}{(n-m)!(n-m)^{g-1}}}{\frac{(g+r+(n-1))!}{n!n^{r+g-1} }} \end{aligned}
We now take the limit n \to \infty and m \to \infty such that \frac{m}{n}\to x and \frac{n-m}{n} \to 1-x Note that \frac{a + (n-1)!}{n! n^{r+g-1} } = \frac{(n +1)}{n} \frac{n+2}{n} \frac{n+a-1}{n} \to 1 \quad \text{ as } n \to \infty Therefore as n \to infty P(M_n= \frac{m+g+r}{r+g+n} \to \frac{(g+r-1)!}{(r-1)!(g-1)!} \left( \frac{m}{n}\right)^{r-1} \left(\frac{n-m}{n}\right)^{g-1} \frac{1}{n} By a Riemann sum argument this shows that the distribution of M_n converges to beta random variable with parameters g and r.
A similar argument shows that if c>1 then M_n converges to beta random variable with parameters g/c and r/c.
If \phi: \mathbb{R} \to \mathbb{R} is a convex function then the graph of \phi lies above any tangent line to the graph and so for any point x_0 we can find a number a such that \phi(x) \ge \phi(x_0)+ a (x-x_0) If \phi is differentiable at x_0 then a=\phi'(x_0) while in general we have more than one choice for a. This simple geometric observation lies behind Jensen inequality, a simple, yet so important inequality.
Theorem 5.1 (Conditional Jensen inequality) Suppose X is a random variable with E[|X|] <\infty and \phi: \mathbb{R} \to \mathbb{R} a convex function. Then we have \phi(E[ X|Y]) \le E[ \phi(X)|Y] \quad \text{conditional Jensen inequality} In particular we have \phi(E[X]) \le E[ \phi(X)] (Jensen inequality).
Proof. Choosing x_0= E[X|Y] we have \phi(X) \ge \phi(E[X|Y]) + a (X-E[X|Y]) Taking conditional expectation with respect to Y gives the result: E[\phi(X)|Y] \ge E[\phi(E[X|Y])|Y] + a E[(X-E[X|Y])|Y]= \phi(E[X|Y]) \,. \quad \blacksquare
Recall that if X is a normal random variable then its moment generating function is given by E[e^{tX}]= e^{\mu t + \frac{\sigma^2 t^2}{2}} which implies, by Chernov bound the concentration bound P( X - E[X] \ge \epsilon) \le \inf_{t \ge 0} \frac{E\left[e^{t(X- E[X])}\right]}{e^{t\epsilon}} = e^{- \sup_{t \ge 0} \left\{ t\epsilon - \frac{\sigma^2 t^2}{2} \right\} } = e^{-\frac{\epsilon^2}{2\sigma^2}} Note that we also have a similar bound for the left tail, P(X-E[X]\le - \epsilon) \le e^{-\frac{\epsilon^2}{2\sigma^2}} prove in the same way.
It turns out that bounded random variables also satisfies such Gaussian concentration bound. To do this we need the following fact which expresses the fact that among all random variable supported on the interval [-A,B] with mean 0 the one which is most spread out is a random variable concentrated on the endpoints -A and B.
Theorem 5.2 (Hoeffdings bound in conditional form) Suppose X is a random variable such that E[X|Y]=0 and -A \le X\le B. Then for any convex function we have E[\phi(X)|Y]\le \phi(-A) \frac{B}{A+B} + \phi(B) \frac{A}{A+B} In particular if A=B, we have E[ e^{tX}] \le \cosh(At) \le e^{\frac{A^2 t^2}{2}}
Proof. If x \in [-A, B] let us write x as a convex combination of -A and B, that is x= \frac{B-x}{A+B}(-A) + \frac{A+x}{A+B} B \,. The convexity of \phi implies that \phi(X) \le \frac{B-x}{A+B}\phi(-A) + \frac{A+x}{A+B} \phi(B) \,. and taking conditional expectation with respect to Y and uising that E[X|Y]=0 gives the result.
Taking now \phi(x)=e^{tx} and A=B we find E[ e^{tX}] \le \frac{1}{2}(e^{-At} + e^{At}) = \cosh(At) = \sum_{n=0}^\infty \frac{A^{2n} t^{2n}}{(2n)!} \le \sum_{n=0}^\infty \frac{A^{2n} t^{2n}}{2^n n!} =e^{\frac{t^2 A^2}{2}}
The next theorem provides concentration inequality for martingales with bounded increments.
Theorem 5.3 (Azuma-Hoeffding’s theorem) Suppose M_n is a martingale with M_0=0 and with bounded increments, i.e. B_n=M_n-M_{n-1} satisfies the bound |B_n| \le \sigma_n \,. Then we have the Gaussian concentration bound P(M_n \ge \epsilon ) \le e^{-\frac{\epsilon^2}{2 \sum_{k=1}^n \sigma_k^2}}
Proof. Using the Martingale property and Theorem 5.2 \begin{aligned} E[e^{tM_n}]= E[E[e^{tM_n}|\mathcal{F}_{n-1}]] = E[E[e^{tM_{n-1}} e^{tB_n}|\mathcal{F}_{n-1}]] = E[e^{tM_{n-1}} E[e^{tB_n}|\mathcal{F}_{n-1}] ] \le e^{\frac{t^2 \sigma_n^2}{2}} E[e^{tM_{n-1}}] \end{aligned} Iterating we find E[e^{tM_n}] \le e^{\frac{t^2 \sum_{k=1}^n\sigma_k^2}{2}} and Chernov bound gives the result.
McDiarmid theorem is an application of Azuma-Hoeffdings theorem and provides concentration bounds for (nonlinear) function of independent random variables X_1, \cdots, X_n, under certain conditions.
Definition 5.1 We say that h(x_1, x_2, \cdots, x_n) satisifies the bounded difference property if there exist constants c_k such that for all x_k, x_k' |h(x_1,\cdots, x_k, \cdots, x_n) - h(x_1,\cdots, x_k', \cdots, x_n)| \le c_k that is we control the change of h when changing only one coordinate at a time.
Theorem 5.4 (McDiarmid Theorem) Suppose X_1, \cdots X_n are independent RVs and h(x_1, \cdots, x_n) satisfies the bounded difference property (almost surely). Then we have \begin{aligned} P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \ge \varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \\ \end{aligned}
Proof. “Martingale trick”: We construct a suitable martingale. Let us define random variable Y_k by Y_0=E[f(X_1, \cdots, X_n)] and, for 1\le k \le n, Y_k =E[h(X_1,\cdots, X_n)|X_1, \cdots, X_k] Note that Y_n=h(X_1,\cdots, X_n) and by construction E[Y_k| X_1, \cdots, X_{k-1}]= Y_{k-1} that is Y_n is a martingale and so is M_n=Y_n-Y_0. To use Theorem 5.3 we need to prove that the increment Y_k-Y_{k-1} = E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k]- E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}] is bounded.
Duplication trick: Let \widehat{X}_k be an independent copy of the random variable X_k. Then by linearity of the expectation and the bounded difference property we have, almost surely,
\left| E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k] - E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_k]\right| \le c_k
\tag{5.1} Furthermore we have
\begin{aligned}
E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]\\
&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_K]
\end{aligned}
\tag{5.2} The first equality holds because X_k and \widehat{X}_{k} are identically distributed and left-hand side is a function of X_1, \cdots, X_{k-1}. The second equality holds because X_k and \widehat{X}_{k} are idependent. Combining Equation 5.1 and Equation 5.2 shows that |Y_k-Y_{k-1}| \le c_k almost surely. \quad \blacksquare
Example: Suppose S_n = X_1 + \cdots + X_n is a sum of IID random variables such that E[X_i]=\mu and a \le X_i \le b. Then the function h(x_1, \cdots, x_n) satisfies the bounded difference property with c_k=(b-a) and we obtain the classical Hoeffding’s theorem for bounded random variables \begin{aligned} P \left( \frac{X_1 + \cdots + X_n}{n} - \mu \ge \varepsilon \right) &\le e^{- \frac{2 n \varepsilon^2}{(b-a)^2}}\ \end{aligned}
Example: Suppose we are interested in estimating the variance. Then using the unbiased variance estimator V_N(X_1, \cdots, X_N)=\frac{1}{N-1} \sum_{k=1}^N\left(X_i - \frac{S_N}{N}\right)^2 = \frac{1}{N-1}\sum_{i=1}^N X_i^2 - \frac{S_N^2}{(N-1)N} If we change X_1 into \widehat{X}_1 then S_N(X_1, \cdots, X_N ) - S_N(\widehat{X}_1, \cdots, X_N ) = X_1 - \widehat{X}_1 and so V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N) = \frac{X_1^2 - \widehat{X}_1^2}{N-1} - \frac{X_1 - \widehat{X}_1}{N-1} \left(\frac{S_N(X_1, \cdots, X_N )}{N} + \frac{S_N(\widehat{X}_1, \cdots, X_N )}{N} \right)
Let us assume a \le X_i \le b, since we can replace X_i by X_i -(a+b/2) without changing V_N we can instead assume that -\frac{(b-a)}{2} \le X_i \le \frac{(b-a)}{2} and then find the bounded difference bound |V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N)| \le \frac{\frac{5}{4}(b-a)^2}{N-1} = c_k and thus by McDiarmid we get P( V_N - \sigma^2 \ge \varepsilon ) \le e^{- \frac{\varepsilon^2}{2 \sum_{k=1}^N c_k^2}} \le e^{ - \frac{(N-1)^2}{N} \frac{4 \varepsilon^2}{50 c^4}} \,\le\, e^{ -(N-2)\frac{4\varepsilon^2}{50c^4}} and this decay exponentially in N again and can be used for a confidence interval for the variance which is very similar to the one for the mean.
Another classical example in probability is the so-called “Balls and Bins problem”: suppose we have m balls that are thrown in n bins, the location of each ball is chosen at random independently of the other balls. It turns out this problem occur in manay algorithmic optmization problems. We can ask many questions realated to this problem. For example what is the probability that one bin has more than two balls in it (this is a version of the famous birthday problem!), or we can ask what is the maximal number of balls in a bin or the number of empty bins, and so on…
Let us compute first the expected number of empty bins. We write N=X_1 + \cdots X_n where X_i=1 is the i^{th} bin is empty and 0 otherwise. For the first urn to be empty, it must be missed by all balls and this occur with probbaility \left(1-\frac{1}{n}\right)^m and thus E[N]= n\left(1-\frac{1}{n}\right)^m We prove a concentration bound around the mean using a martingale argument. We consider the sequence of random variable Y_i to be the bin in which the i^{th} ball falls and write N=N(Y_1, \cdots, Y_m). To apply Theorem 5.4 we check the bounded difference equality. Consider how N changes when we change the location of the i^{th} ball. If the i^{th} balls lands in a bin of its own then changing Y_i may increase N by 1 or left it unchanged. If it lands in a bin with other balls, then changing Y_i may decrease N by 1. Then, by Theorem 5.3 we have P(|N - E[N]|\ge \epsilon)\le 2 e^{-\frac{2\epsilon^2}{m}}
As another application of Azuma-Hoeffding’s theorem we consider empirical processe which are given by Z(X_1, \cdots, X_n)=\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum_{k=1}^n f(X_i) - E[f(X_i)] \right| where X_i are IID random variables and f belongs to a class of function \mathcal{F}. We assume here that \sup_{x}|f(x)| \le B \text{ for all } f \in \mathcal{F} i.e. the functions f are uniformly bounded. One of the simplest example of empirical process to consider the function f_t(x)=1_{x \le t}. By the LLN we have \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} \to P(X \le t)=F_X(t) \quad \text{ almost surely} where F_X is the distribution of the RVs X_i. By Glivenko–Cantelli Theorem we have \sup_{t \in \mathbb{R} }| \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} - F_X(t)| \to 0 \quad \text{ almost surely} that is we have uniform convergence of the empirical distribution to the true distribution.
Empirical processe satisfy the bounded difference property: indeed if we set g(x_1, \cdots, x_n) = \left| \frac{1}{n}\sum_{k=1}^n f(x_i) - E[f(X)] \right| Then g(x_1, \cdots, \tilde{x}_k, \cdots, x_n) = \left| \frac{1}{n}\sum_{k=1}^n f(x_i) - E[f(X)] + \frac{1}{n} (f(\tilde{x}_k) - f(x_k)) \right| \le g(x_1, \cdots, \tilde{x}_k, \cdots, x_n) + \frac{2B}{n} from which we conclude that Z(x_1, \cdots, \tilde{x}_k, \cdots, x_n) \le Z(x_1, \cdots, {x}_k, \cdots, x_n) +\frac{2B}{n}. Interchanging x_k and {\tilde x}_k proves the bounded difference property with c_k=\frac{2B}{n}. Then Theorem 5.3 show that, for every \epsilon>0. P(|Z - E[Z]|> \epsilon) \le 2 e^{-\frac{n \epsilon^2}{2 B^2}} This tells us that the behavior of Z is control by its mean E[Z]. For example if we set \delta=e^{-\frac{n \epsilon^2}{2 B^2}} we have |Z-E[Z]| \le B \sqrt{\frac{2}{n}\log \frac{1}{\delta}} \quad \text{ with probability at least } 1- 2 \delta for any \delta >0.
To go further we need to control the mean E[Z] and to do this we need two results.
Theorem 5.5 Suppose Y_1, Y_2, \cdots, Y_N are RVs (no independence needed) with E[e^{tY_i}]\le e^{\frac{t^2\sigma^2}{2}} for i=1,\cdots, N. Then E[ \max_{i} Y_i ] \le \sigma \sqrt{2 \ln(N)}
Proof. By Jensen inequality we have e^{t E[ \max_{i} Y_i ]} \le E[ e^{t \max_{i} Y_i } ] \le \sum_i E[e^{t Y_i}]\le N e^{\frac{t^2\sigma^2}{2}} Thus E[ \max_{i} Y_i ] \le \frac{\log{N}}{t} + \frac{t\sigma^2}{2} and optimizing over t yields the result. \quad \blacksquare.
Note that if Y_i satisfies the condition of the theorem so does -Y_i and thus we also have E[ \max_{i} |Z_i| ] = E[ \max_{i} \max\{ Z_i, -Z_i\} ] \le \sigma \sqrt{ 2 \log{2 N}} \le 2 \sigma \sqrt{ \log{N}}
Theorem 5.6 Suppose \epsilon_i are IID Rademacher random variable (i.e. equal to \pm 1 with probability \frac{1}{2}). Then we have E\left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(X)]\right|\right] \le 2 E_{X} \left[ E_{\epsilon} \left[ \sup_{f}|\frac{1}{n} \sum_{i=1}^n \epsilon_{i} f(X_i)| \right]\right]
Proof. We use a duplication trick and consider RVs Y_1, \cdots Y_n which are indepndent copies of X_1, \cdots X_n. We have then
\begin{aligned}
E\left[ \sup_{f}| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(X_i)]|\right] &= E\left[ \sup_{f}| \frac{1}{n} \sum_{i=1}^nf(X_i)-E[f(Y_i)]|\right] \\
& = E_{X} \left[ \sup_{f}\left| E_{Y} [\frac{1}{n} \sum_{i=1}^nf(X_i)-f(Y_i)] \right| \right] \\
& \le E_{X} E_{Y} \left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^nf(X_i)-f(Y_i) \right|\right] \\
& \le E_{X} E_{Y} \left[ \sup_{f}\left| \frac{1}{n} \sum_{i=1}^n \epsilon_{i} (f(X_i)-f(Y_i)) \right|\right]
\end{aligned}
where in the last line we have used that multiplying (f(X_i)-f(Y_i)) by a factor -1 is equivalent to exchanging X_i and Y_i and thus does not change the expectation. Taking now expectation over \epsilon_i and using the triangle inequality yields the result.
This is a very useful since have gotten rid of E[f(X_i)] which maybe unknown and we have now a quantity which depnds only on the data X_1, \cdots, X_n. How useful this can be is deonstrated next in the context of Glivenko-Catelli Theorem.
Theorem 5.7 We have P\left( \sup_{t \in\mathbb{R}} \left| \frac{1}{n}\sum_{k=1}^n 1_{\{X_i\le t\}} - F_X(t) \right| \ge 8 \sqrt{ \frac{\log(n+1)}{n}} + t \right) \le e^{-\frac{n t^2}{2}}
Proof. Using Theorem 5.6 we bound E_{\epsilon}[ \sup_{t \in\mathbb{R}} | \frac{1}{n}\sum_{k=1}^n \epsilon_{i}1_{\{x_i\le t\}}|] for some fixed values of the x_i. Given those values and since f(x)=1_{\{x\le t\}} is a characteristic function of a half interval, as t varies there are only n+1 possible values for the functions (f(x_1), \cdots, f(x_n)) (To see this order the x_i in increasing order). Therefore the supremum over t reduces to a supremum over n+1 values (which depend on x_1, \cdots, x_n)).
Since \epsilon_i satisfies a Gaussian bound E[e^{t \epsilon_i}]\le e^{\frac{t^2}{2}} we have E_\epsilon[e^{ t\frac{1}{n} \sum_{k=1}^n \epsilon_{k} 1_{\{x_k\le t\}}}]= \prod_{k=1}^n E_\epsilon[e^{t \epsilon_k 1_{\{x_k\le t\}}} ] \le \prod_{k=1}^n e^{\frac{t^2}{2n^2}} = e^{\frac{t^2}{2n}} a bound which is independent of the x_i’s!
and thus by Theorem 5.5 we find E_{\epsilon}[ \sup_{t \in\mathbb{R}} | \frac{1}{n}\sum_{k=1}^n \epsilon_{i}\frac{1}{n}1_{\{x_i\le t\}}|] \le \sqrt{4 \ln(n+1)} \frac{1}{\sqrt{n}}
Consider a Markov chain X_j with state space S and transition probabilities P and consider a bounded function f: S \to \mathbb{R} .
We want to construct a martingale associated to the sequence of random variables f(X_j), j=0,1,2,\cdots.
If the f(X_j) were independent it would be enough to assume that they have mean 0.
In general we use the following idea. If we have sequence of integrable RV Y_1, Y_2, \cdots then we set
D_m= Y_m - E[Y_m|Y_0, \cdots, Y_{m-1}]
and we have then E[D_m|Y_0, \cdots, Y_{m-1}]=0. So we can build a martingale by
M_n = \sum_{j=1}^n D_j = \sum_{j=1}^n (Y_m - E[Y_m|Y_0, \cdots, Y_{m-1}]) \,.
The same idea works for Markov chain: if f is bounded then, using the Markov property D_m = f(X_m)- E[f(X_m)|Y_0, \cdots, Y_{m-1}] = f(X_m)- E[f(X_m)| Y_{m-1}] = f(X_m) -Pf(X_{m-1}) is a martingale increment with respect to the sequence X_1, X_2, \cdots.
We obtain therefore the martingale \begin{aligned} M_n = \sum_{j=1}^n f(X_j) -Pf(X_{j-1}) & = f(X_n) -f(X_0) + \sum_{j=0}^{n-1} (f(X_j)- Pf(X_j)) \\ & = f(X_n) -f(X_0) - \sum_{j=0}^{n-1} Af(X_j) \end{aligned} \tag{6.1} for f bounded and where we used the notation A=P - I. This martingale is called the Dynkin’s martingale.
If we want to apply martingale theory to analyze the sum \sum_{j=1}^n g(X_j) then we would like to use the martingale Equation 6.1. To do this we must a find a function f such that Af = (P -I)f = - g Then we would have a martingale M_n = f(X_n) - f(X_0) + \sum_{j=0}^{n-1} g(X_j) \quad \quad \text{ with } Af = -g which, up to the correction term f(X_n) - f(X_0) is equal to \sum_{j=1}^n g(X_j).
What we have uncovered to build a martingale from a Markov chain is an important equation
Definition 6.1 (Poisson equation) Let X_n be a Markov chain with transition matrix P. A function f satisfies the Poisson equation for the function g if Af = -g \quad \quad \text{where } \quad A =P - I
Note that Poisson equation needs not have a solution for arbitrary g. We investigate this in the context of finite state space Markov chain.
Theorem 6.1 Suppose X_n is an irreducible Markov chain with transition matrix P and stationary distribution \pi. Then the Poisson equation Af =-g has a solution if and only if g has mean 0 with respect to the stationary distribution, that is \pi g = \sum_{i} \pi(i) g(i) =0.
The function f is then given by f= (\Pi -(P-I))^{-1}g where \Pi is the matrix whose each row is the stationary distribution \pi.
Proof. By irreducibility the matrix P has a unique right eigenvector h=(1, \cdots, 1)^T (up to a mutliplicative constant) and unique left eigenvector \pi. The algberaic multiplicity of the eigenvalue 1 is also equal to 1 since (P-1)^2 g=0 implies (P-1)g=ch of Pg = g +ch, which inturns implies that \pi g= \pi Pg = \pi g + c which is impossible unless c=0.
The matrix \Pi is a projection, \pi^2=\pi and we have P \Pi = \Pi P = \Pi. That is \Pi projects onto the eigenspace corresponding to the eigenvalue 1 and we have \Pi f = (\pi f) h. So if \pi f=0 then (P - \Pi)f= Pf. As a consequence (P-\Pi)=P(I -\Pi) has the same eigenvalues as P except the eigenvalue 1 which is replaced by the eigenvalue 0 and thus I-(P-\Pi) is invertible.
If f solves the Poisson equation then (P-I)f = -g and thus \pi P f - \pi f = \pi f -\pi f = 0 = -\pi g which implies that \pi g=0.
Taking g with \pi g =0 let us take f= (\Pi -(P-I))^{-1}g. This implies that \Pi f -(P-I)f = g which proves the claim if we can show that \Pi f =0. But since \Pi commutes with P we have \Pi f = \Pi (\Pi -(P-I))^{-1}g = (\Pi -(P-I))^{-1} \Pi g = 0 provided that \pi g=0. \quad \blacksquare.
The object involved in solving the Poisson equation occur in many differenr context and desrve a name.
Definition 6.2 (Fundamental matrix) For an irreducible finite state Markov chain X_n with transition matrix P and stationary distribution \pi, the fundamental matrix is given by Z = (I - (P-\Pi))^{-1} where \Pi is the matrix whose every rows is the stationary distribution \pi.
The Poisson equation is simply the equation f=Zg.
Note that if X_n is irreducible and aperiodic we have proved that P^n-\Pi converges exponentially fast to 0 and using that (P-\Pi)^n =P^n - \Pi we have the convergent series Z f = \sum_{n=0}^\infty P^n (f - \pi f) and
We can now use this to prove a central limit theorem for Markov chain.
Theorem 6.2 (central limit theorem for Markov chain) For g: S \to \mathbb{R} with \pi g=0 and for any initial distribution of X_0, as n \to \infty, \frac{1}{\sqrt{n}} \sum_{j=0}^{n-1} g(X_j) converges in distribution to a mean zero normal random variable with variance \sigma^2(g) = \pi f^2 - \pi((Pf)^2) where f is the solution of the Poisson equation (P-I)f = - g.
Proof. Consider the martingale M_n = f(X_n) - f(X_0) + \sum_{j=0}^{n-1} g (X_j) whose increments are D_j= f(X_j) - Pf(X_{j-1}).
We claim that E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] =1 \tag{6.2} Indeed, using the properties of conditional expectations we have \begin{aligned} & E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E \left[ E\left[ \frac{e^{i \theta M_n}}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } | \mathcal{F}_{n-1} \right]\right] = E \left[ \frac{E[e^{i \theta M_n}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] \\ & = E \left[ \frac{E[e^{i \theta M_{n-1}+D_n}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E \left[ \frac{ e^{i \theta M_{n-1}} E[e^{i \theta D_{n}}|\mathcal{F}_{n-1}]}{\prod_{j=1}^n E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] = E\left[ \frac{e^{i \theta M_{n-1}}}{\prod_{j=1}^{n-1} E[ e^{i \theta} D_j |\mathcal{F}_{j-1}] } \right] \\ \end{aligned} Iterating proves the statement. We now use a Taylor expansion to find \log E[e^{i\frac{\theta}{\sqrt{n}}D_j}|\mathcal{F}_{j-1}] = - \frac{\theta^2}{2n}E[D_j^2|\mathcal{F}_{n-1}] + \frac{1}{\sqrt{n}} R_n where R_n is a bounded random variable. Note that E[D_j^2|\mathcal{F}_{n-1}] = E[(f(X_j) - Pf(X_{j-1}))^2 | X_{j-1} ] = Pf^2(X_{j-1}) - (Pf(X_{j-1}))^2
Using the the strong law of large numbers for Markov chain,
\lim_{n \to \infty }\sum_{j=1}^\infty \log E[e^{i\frac{\theta}{\sqrt{n}}D_j}|\mathcal{F}_{j-1}] = - \frac{\theta^2}{2} \pi (Pf^2 - (Pf)^2) \quad \text{ almost surely}
Therefore, from Equation 6.2, we find that
\lim_{n \to \infty} E\left[ e^{i \theta M_n} \right] = e^{-\frac{\theta^2 \sigma^2}{2}}
where
\sigma^2(g) = \pi (f^2 - (Pf)^2)
.
Since \frac{1}{\sqrt{n}}M_n and \frac{1}{\sqrt{n}}\sum_{j=0}^{n-1} {\overline h}{X_j} only differ by \frac{1}{\sqrt{n}}(f(X_n)-f(X_0)) we obtain the central limit theorem. \quad \blacksquare.
We can rewrite the variance in terms of the fundamental matrix Z=(I-(P-\Pi))^{-1}. Indeed since f = Z g and \pi f = \pi g=0 \begin{aligned} f^2 -(Pf)^2 & = (f- Pf)(f+ Pf) = (f -(P-\Pi)f)(f + (P-\Pi)f) \\ & = (I-(P-\Pi))f (I + (P-\Pi))f = (I-(P-\Pi))f (2I -(I-(P-\Pi)))f \\ & = 2 g Z g - g^2 \end{aligned}
Thus the asymptotic variance is given by
\sigma^2(g) = \pi ( 2 g Zg - g^2)
If X_n is aperiodic then we have
\sigma^2 = \pi \left(^2 + \sum_{n=1}^\infty g P^n g\right)
In the context of Monte-Carlo methods one can try and use the central limit theorem as way to compare different Monte-Carlo Markov chain methods used to sample the same distribution \pi. Indeed the smaller the variance, the “better” the Monte-Carlo will perform since the fluctuations around the stationary values \pi g will be smaller.
The following theorem is a result in this direction. It shows that the more a Monte-Carlo Markov chain “jumps”, the smaller the asymptotic variance will be.
Theorem 6.3 Suppose P_1 and P_2 are two transition probabilities such that P_1 and P_2 satisfy detailed balance with respect to the stationary distribution \pi. Assume that P_1(i,j) \le P_2(i,j) \quad \text{for all } i \not= j then for all g with \pi g=0 we have \sigma_{P_1}^2(g) \le \sigma_{P_2}^2(g) \,.
Proof. Suppose P satisfies the detailed balance condition. It useful to consider the scalar product \langle f, g\rangle_{\pi} = \sum_{i} \pi(i) f(i) g(i)
As we have seen the detailed balance implies that P is self-adjoint with respect to this scalar product \langle f, P g\rangle_{\pi} = \langle P f, g\rangle_{\pi} \,. Consider the so-called Dirichlet form (with A=P-I) \begin{aligned} \mathcal{E}(f,f) = \langle f, Af \rangle_{\pi} = \sum_{i} \pi(i) f(i) \left(\sum_{j}P(i,j)f(j) - f(i) \right) = \sum_{i,j} \pi(i) f(i) P(i,j)(f(j) - f(i)) \end{aligned} Using the detailed balance we \pi(i)P(i,j) = \pi(j) P(j,i) and then interchanging the indices we find \begin{aligned} \mathcal{E}(f,f) = \sum_{i,j} \pi(j) f(i) P(j,i)(f(j) - f(i)) = - \sum_{i,j} \pi(i) f(j) P(i,j)(f(j) - f(i)) \end{aligned} Therefore adding the these tow formulas we find \mathcal{E}(f,f) = \frac{1}{2} \sum_{i\not j} \pi(i) (f(i)-f(j)) P(i,j) (f(i)-f(j)) Note that the Dirichlet form does not involve the diagonal terms P(i,i). Therefore if P_1(i,j) \le P_2(i,j) for i \not= j we have an inequality between the dirichlet forms \mathcal{E}_{P_1}(f,f) \le \mathcal{E}_{P_2}(f,f)
The variance \sigma_2(g) has the from \sigma_2(g) = \langle g, g\rangle_{\pi}